rm(list=ls()) #rm(list of objects) removes all objects from memory
library(tidyverse)
library(fpp3)
library(stargazer)
library(latex2exp)
library(forecast)
library(tidyr)
library(tsibble)
library(magrittr)
library(ggplot2)
library(dplyr)
library(seasonal)
library(reshape2)
library(lubridate)

Introduction

As seen in the news, the effect of the pandemic has been experienced through general inflation, higher transportation costs, and food supply shortages ultimately causing significant food price inflation in the United States. Typical consumers are suffering due to food shortages and higher food prices. This made us wonder how the producers of food and agricultural goods have been affected. Farmers in a competitive market have historically suffered and have needed government subsidies. Today, we are going to analyze the near future for agricultural producers of 3 major U.S. exports and will analyze the near future for a major import. The three major exports to be analyzed are soybeans, cotton, and rice. The major import to be analyzed is avocados.

Agricultural Crop 1: (Chris)

The US is one of the world’s leading soybean producers & exporters. In 2020, the United States exported 25.6 billion dollars of soybeans. The total trade of soybeans is worth 64 billion dollars, so the U.S. exports roughly 40% of total soybean exports. The only country who exported more soybeans in 2020 than the U.S. was Brazil at 28.6 billion dollars worth of soybeans exported (Soybeans).

Soybean Production Yield

Import Data for Soybean Yield
# Import data for yield and save as DataYield
df <- read.csv("C:/Users/chris/OneDrive/Desktop/FAOSTAT_data_5-6-2022.csv")

DataYield <- as_tsibble(df, index = Year)
Soybean Yield Plot
# Plot Yield
DataYield %>% autoplot(Value) + labs(title = "Production Yield: Soybeans", y = "hg/ha", x = "Year")

Yield is measured in hg/ha: hectogramme per hectare. 1 hg/ha is 100 grams of soybeans per 10,000 square meters. This ratio is used to measure soybean farming efficiency from 1961 to 2020. The plot above has an increasing trend due to technological innovations in soybean production. In addition to typical farming innovations (tractors, plows, …), soybean production efficiency has increased due to development of disease-resistant soybean seeds. There is no seasonality present in the data because the data is measured on a yearly basis. There appears to be a repeating cycle that ranges in length from 2-5 years but seems to be mainly repeated every 3 years. Since there is some cyclicity and no seasonality, forecasting with an ARIMA model makes the most since to capture cycles in the data.

Transform the Data to Make it Stationary
# Do log transformation and see if it needs to be seasonally differenced or differenced
DataYield1 <- DataYield %>% mutate(log_Yield = log(Value)) %>% features(log_Yield, unitroot_nsdiffs)

DataYield1
# A tibble: 1 x 1
  nsdiffs
    <int>
1       0
DataYield2 <- DataYield %>% mutate(log_Yield = log(Value)) %>% features(log_Yield, unitroot_ndiffs)

DataYield2
# A tibble: 1 x 1
  ndiffs
   <int>
1      1

After using a log transformation, the data on soybean yield needs to be differenced once for the data to be stationary.

Visually Check that the Data is Stationary
# Check if the data is stationary and analyze ACF and PACF
DataYield %>% gg_tsdisplay(log(Value) %>% difference(), plot_type = 'partial')

Visually, the data looks stationary. The data is roughly horizontal, has no predictable patterns, and is very close to having constant variance. The acf of the data also drops significantly after the 1st lag. Based on the acf and pacf plots, I would guess an ARIMA(0,1,1), ARIMA (1,1,0), ARIMA(1,1,1), or an ARIMA(3,1,0). I allow R to pick the best models between (0:3,1,0:1), between (0:1,1,0:1), and automatically without restrictions.

Fit ARIMA MODELS
# Fit ARIMA models
fitARIMA <- DataYield %>%
model(
auto = ARIMA(log(Value))
)

report(fitARIMA)
Series: Value 
Model: ARIMA(4,1,0) w/ drift 
Transformation: log(Value) 

Coefficients:
          ar1      ar2      ar3      ar4  constant
      -0.8038  -0.5811  -0.5001  -0.2237    0.0393
s.e.   0.1271   0.1528   0.1512   0.1259    0.0101

sigma^2 estimated as 0.00624:  log likelihood=68.16
AIC=-124.33   AICc=-122.71   BIC=-111.86
fitguess <- DataYield %>%
model(
guess = ARIMA(log(Value) ~ pdq(0:3, 1, 0:1))
)

report(fitguess)
Series: Value 
Model: ARIMA(3,1,0) w/ drift 
Transformation: log(Value) 

Coefficients:
          ar1      ar2      ar3  constant
      -0.7245  -0.4665  -0.3310    0.0312
s.e.   0.1224   0.1417   0.1213    0.0103

sigma^2 estimated as 0.006476:  log likelihood=66.64
AIC=-123.28   AICc=-122.15   BIC=-112.89
fitguess2 <- DataYield %>%
model(
guess = ARIMA(log(Value) ~ pdq(0:1, 1, 0:1))
)

report(fitguess2)
Series: Value 
Model: ARIMA(0,1,1) 
Transformation: log(Value) 

Coefficients:
          ma1
      -0.5755
s.e.   0.0882

sigma^2 estimated as 0.007228:  log likelihood=62.02
AIC=-120.04   AICc=-119.83   BIC=-115.88

The model with the best fit according to AICc is auto model: ARIMA(4,1,0) w/ drift.

Check the Residuals of ARIMA(4,1,0) w/ drift
# Check the residuals of the model with the lowest AICc
augment(fitARIMA) %>% gg_tsdisplay(.resid, plot_type = "hist")

The residuals look like white noise. There are no significant lags in the acf, and the residuals are centered around zero.

# Box-Ljung test
Box.test(augment(fitARIMA)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(fitARIMA)$.resid
X-squared = 12.222, df = 10, p-value = 0.2704

The Box-Ljung test returns a p-value of 0.2704, which means that we fail to reject the hypothesis the residuals are not white noise residuals. This further confirms that the residuals resemble white noise.

Confirming with Training Data that the Forecasting Model Chosen is the Correct Model
# Create Training Data
DataYieldTrain <- DataYield %>%
  filter_index(. ~ "2009")
# Check Arima model AICc with Training Data
fitARIMAtrain <- DataYieldTrain %>%
model(
auto = ARIMA(log(Value))
)

report(fitARIMAtrain)
Series: Value 
Model: ARIMA(0,1,1) 
Transformation: log(Value) 

Coefficients:
          ma1
      -0.6163
s.e.   0.0898

sigma^2 estimated as 0.007885:  log likelihood=48.39
AIC=-92.78   AICc=-92.51   BIC=-89.04
fitguesstrain <- DataYieldTrain %>%
model(
guess = ARIMA(log(Value) ~ pdq(0:3, 1, 0:1))
)

report(fitguess)
Series: Value 
Model: ARIMA(3,1,0) w/ drift 
Transformation: log(Value) 

Coefficients:
          ar1      ar2      ar3  constant
      -0.7245  -0.4665  -0.3310    0.0312
s.e.   0.1224   0.1417   0.1213    0.0103

sigma^2 estimated as 0.006476:  log likelihood=66.64
AIC=-123.28   AICc=-122.15   BIC=-112.89
fitguess2train <- DataYieldTrain %>%
model(
guess2 = ARIMA(log(Value) ~ pdq(0:1, 1, 0:1))
)

report(fitguess2)
Series: Value 
Model: ARIMA(0,1,1) 
Transformation: log(Value) 

Coefficients:
          ma1
      -0.5755
s.e.   0.0882

sigma^2 estimated as 0.007228:  log likelihood=62.02
AIC=-120.04   AICc=-119.83   BIC=-115.88
# Accuracy Testing
fcARIMAtrain <- fitARIMAtrain %>% forecast(h = 10)

accuracy(fcARIMAtrain, DataYield)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 auto   Test  2799. 3718. 3126.  8.38  9.60  1.78  1.60 0.704
fcguesstrain <- fitguesstrain %>% forecast(h = 10)

accuracy(fcguesstrain, DataYield)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 guess  Test  2799. 3718. 3126.  8.38  9.60  1.78  1.60 0.704
fcguesstrain2 <- fitguess2train %>% forecast(h = 10)

accuracy(fcguesstrain2, DataYield)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 guess2 Test  2799. 3718. 3126.  8.38  9.60  1.78  1.60 0.704

The training data contains data up to 2009. When forecasting the training data for the next 10 years and comparing the results to the test data, the RMSE values are the same for all three models. Since RMSE is the same for all three models, I will stick with ARIMA(4,1,0) w/ drift because it has the lowest AICc in the full dataset meaning it has the best fit for forecasting soybean yield. In addition, the MAPE value is 9.60 confirming that this forecast model is highly accurate.

Forecast Yield Data up to 2024
# Forecast for the next 4 years
FCDataYield <- DataYield %>%
  model(auto = ARIMA(log(Value))) %>%
  forecast(h = 4)

DataYield %>%
  model(auto = ARIMA(log(Value))) %>%
  forecast(h = 4) %>%
  autoplot() + labs(title = "Production Yield Forecast: Soybeans", y = "hg/ha")

DataYield %>%
  model(auto = ARIMA(log(Value))) %>%
  forecast(h = 4) %>%
  autoplot(DataYield) + labs(title = "Production Yield Forecast: Soybeans", y = "hg/ha")

Forecast results:
  • 2020: 33,785
  • 2021: 34,885.58
  • 2022: 35,194.18
  • 2023: 35,229.86
  • 2024: 35,449.17

The soybean yield forecast predicts an increase in soybean yield from 2020 to 2024. Soybean yield is predicted to reach the highest value in U.S. history in 2022 and set record highs for 2023 and 2024 as well.

Policy Implications

Congress can see that soybean production efficiency will be at an all-time high in 2022 and will continue to increase. This means that the subsidization of soybeans will offer higher returns than previous years. If Congress wants to further increase efficiency, soybean producers could be subsidized. This subsidization would allow soybean producers to improve their technology in order to increase farming efficiency. In addition, policymakers can provide financial incentives to third parties to research and improve upon existing technology.

Soybean Production Quantity

Import the Data for Soybean Production Quantity
# Import the Data
SoybeanProduction <- read.csv("C:/Users/chris/OneDrive/Desktop/SoybeanProduction.csv")
# Covert from a tibble to a tsibble
DataProd <- SoybeanProduction %>%
  mutate(Year = ï..Year) %>%
  as_tsibble(index=Year)
Soybean Production Quantity Plot
# Plot using autoplot
DataProd %>% autoplot(Production) + labs(title = "Production Quantity: Soybeans", y = "tonnes", x = "Year")

Soybean production quantity is measured in tonnes from the year 1961 to 2020. The goal of using soybean production quantity data is to predict the expected U.S. exports for soybeans in the upcoming years. The plot above shows that there is an increasing trend in production quantity. There is no seasonality present because the data is measured yearly. Similar to soybean yield, there appears to be a somewhat inconsistent cycle that repeats roughly every three years. The cycle occasionally repeats every two years, and can occosionally take up to 5 years to repeat. In order to best capture this potential cycle, an ARIMA model will be used to forecast production quantity.

Transform the Data to Make it Stationary
# Check for the need of seasonal differencing and differencing
DataProd1 <- DataProd %>% mutate(log_Prod = log(Production)) %>% features(log_Prod, unitroot_nsdiffs)

DataProd1
# A tibble: 1 x 1
  nsdiffs
    <int>
1       0
DataProd2 <- DataProd %>% mutate(log_Prod = log(Production)) %>% features(log_Prod, unitroot_ndiffs)

DataProd2
# A tibble: 1 x 1
  ndiffs
   <int>
1      1

After a log transformation to account for the increasing trend, the data only needs to be differenced once.

Visually Check that the Data is Stationary
# Check if the data is stationary and analyze ACF and PACF
DataProd %>% autoplot(log(Production) %>% difference())

DataProd %>% gg_tsdisplay(log(Production) %>% difference(), plot_type = 'partial')

Visually, the data looks stationary. The data is roughly horizontal and has no predictable patterns. The data has constant variance from 1972 to 2020. Overall, the entire data is very close to having constant variance. The acf of the data also drops significantly after the 1st lag. The significant acf lag 12 signals some concern, but overal the data looks stationary. Based on the acf and pacf plots, I would guess an ARIMA(0,1,1), ARIMA (1,1,0), or ARIMA(1,1,1). I allow R to pick the best model between (0:1,1,0:1) and to pick automatically without restrictions.

Fit ARIMA MODELS
# Fit ARIMA models and report AICc
fitARIMAauto <- DataProd %>%
model(
auto = ARIMA(log(Production))
)

report(fitARIMAauto)
Series: Production 
Model: ARIMA(0,1,1) w/ drift 
Transformation: log(Production) 

Coefficients:
          ma1  constant
      -0.5209    0.0309
s.e.   0.1045    0.0074

sigma^2 estimated as 0.01409:  log likelihood=42.89
AIC=-79.79   AICc=-79.35   BIC=-73.55
fitARIMAguess <- DataProd %>%
model(
auto = ARIMA(log(Production) ~ pdq(0:1, 1, 0:1))
)

report(fitARIMAguess)
Series: Production 
Model: ARIMA(0,1,1) w/ drift 
Transformation: log(Production) 

Coefficients:
          ma1  constant
      -0.5209    0.0309
s.e.   0.1045    0.0074

sigma^2 estimated as 0.01409:  log likelihood=42.89
AIC=-79.79   AICc=-79.35   BIC=-73.55

The best fitted model in both cases is ARIMA(0,1,1) with drift.

Check the Residuals of ARIMA(0,1,1) w/ Drift
# Plot the residuals
augment(fitARIMAauto) %>% gg_tsdisplay(.resid, plot_type = "hist")

#Box-Ljung test
Box.test(augment(fitARIMAauto)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(fitARIMAauto)$.resid
X-squared = 4.8454, df = 10, p-value = 0.9013

There is only one significant lag in the acf function at lag 12. This lag is borderline insignificant, so it is not of concern. The residuals are centered around zero. Other than the low variance of residuals until 1972 and drop in 2019, the residuals looks like white noise. The Box-Ljung test provides a p-value of 0.9013, which means the null hypotheses is accepted. The test concludes that the residuals resemble white noise.

Confirming with Training Data that the Forecasting Model Chosen is an Accurate Model
# Create training data
DataProdTrain <- DataProd %>%
  filter_index(. ~ "2009")
# Fit the training model
fitARIMAtrain <- DataProdTrain %>%
model(
auto = ARIMA(log(Production))
)
# Test the training model
fcARIMAtrain <- fitARIMAtrain %>% forecast(h = 10)

accuracy(fcARIMAtrain, DataProd)
# A tibble: 1 x 10
  .model .type        ME      RMSE      MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>     <dbl>     <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 auto   Test  -5836137. 11367259. 7702027. -6.67  8.28  1.29  1.49 0.125

Since the best ARIMA model in both cases above was ARIMA(0,1,1) with drift, the MAPE will be analyzed to confirm an accurate forecast. The MAPE value is 8.28, so this forecast model is highly accurate.

Forecast Yield Data up to 2024
FCDataProd <- DataProd %>%
  model(ARIMAauto = ARIMA(log(Production))) %>%
  forecast(h = 4)

DataProd %>%
  model(ARIMAauto = ARIMA(log(Production))) %>%
  forecast(h = 4) %>%
  autoplot() + labs(title = "Production Quantity Forecast: Soybeans", y = "Tonnes")

DataProd %>%
  model(ARIMAauto = ARIMA(log(Production))) %>%
  forecast(h = 4) %>%
  autoplot(DataProd) + labs(title = "Production Quantity Forecast: Soybeans", y = "Tonnes")

Forecast results:
  • 2020: 112,549,240
  • 2021: 117,624,286
  • 2022: 121,508,431
  • 2023: 125,520,515
  • 2024: 129,664,741

The soybean production quantity forecast predicts an increase in production quantity from 2020 to 2024. From 2021 to 2022, there is a predicted 3.3% increase in soybean production quantity. Soybean production quantity is predicted to reach the highest value in U.S. history in 2022 and also set record highs for 2023 and 2024. This makes sense with the results. Record soybean production quantity levels intuitively make sense because farming efficiency is predicted to be at all time highs during the same periods.

Policy Implications

Congress can see that soybean production quantity will at an all-time high in 2022 and will continue to increase. If Congress wants to export a higher quantity of soybeans than the forecasted valued above, soybeans can be subsidized appropriately to hit desired quantity for soybean exports. Subsidizing soybean producers will increase production quantity in two ways. First, existing soybean producers can afford to grow more soybeans to increase their total production quantity. Second, increased subsidization of soybean production will incentivize new producers to join the soybean export market. These new producers will start producing their own soybeans and ultimately increase total production quantity.

Soybean Producer Price Index

Import the Data for Soybean Production Quantity
# Load in FRED data
library(readr)
PPI_soybean <- read_csv("C:/Users/chris/OneDrive/Desktop/WPU01830131.csv")

# Turn the tibble into a tsibble
PPI <- PPI_soybean %>%
  mutate(YearMonth = yearmonth(DATE)) %>%
  as_tsibble(index=YearMonth)
# Remove Date
PPI_TS = subset(PPI, select = -c(DATE))
# Reorder variables
PPI_Final <- select(PPI_TS, YearMonth, WPU01830131)
# Define PPI as a numeric variable
PPI_Final$PPI <- as.numeric(PPI_Final$WPU01830131)
# Select YearMonth and PPI
PPI_Final_TS_SAVE <-select(PPI_Final, YearMonth, PPI)
# Filter for year 1999 to 2022
PPI_Final_TSS <- PPI_Final_TS_SAVE %>% filter_index("1999 Jan" ~ .)
# Add MoM and YoY adjustments
PPI_Final_TSSS <- PPI_Final_TSS %>% mutate(
  MoM = ((PPI - lag(PPI)) / lag(PPI)),
  YoY = ((PPI - lag(PPI, 12)) / lag(PPI, 12))
)
# Filter the data for year 2000 to get rid of NAs
PPI_Final_TS <- PPI_Final_TSSS %>% filter_index("2000 Jan" ~ .)
Soybean PPI Plot
# Plot PPI
PPI_Final_TS %>% ggplot(aes(x = YearMonth, y = PPI)) +
  geom_line() + labs(title = "Producer Price Index: Soybeans", y = "PPI")

PPI_Final_TS %>% gg_season(PPI) + labs(title = "Season Plot: Soybeans", y = "PPI")

Producer Price Index is measured in the data with 1982 as the base year. Producer Price Index measures the prices received by domestic producers of goods and services. Lower PPI signals that producers will receive less income relative to prior history. In this case, there will be more of a need for subsidization. A higher PPI signals that producers will receive more income relative to the prior year. In this case, there will be less of a need for subsidization. In the plot above, there is an increasing general trend. There is a cycle of highs followed by lows that can be explained by changes in demand because price supplied has to meet price demanded in a competitive market. In this cycle, PPI peaks roughly every 4-5 years except for the the last peak being delayed by the COVID-19 pandemic. The data is volatile with high persistence present during these cycle peaks. There is some seasonality as shown above in the seasonal plot, but there is not a pronounced season. An ARIMA model will be used to capture the cyclicity of this data.

Transform the Data to Make it Stationary
# Check to see if the data needs to be seasonally differenced
PPI_FInal_TS1 <- PPI_Final_TS %>% mutate(log_PPI = log(PPI)) %>% features(log_PPI, unitroot_nsdiffs)

PPI_FInal_TS1
# A tibble: 1 x 1
  nsdiffs
    <int>
1       0
# Check to see if the data needs to be differenced
PPI_FInal_TS2 <- PPI_Final_TS %>% mutate(log_PPI = log(PPI)) %>% features(log_PPI, unitroot_ndiffs)

PPI_FInal_TS2
# A tibble: 1 x 1
  ndiffs
   <int>
1      1

After a log transformation, the data needs to be differenced once to become stationary.

Visually Check that the Data is Stationary
PPI_Final_TS %>% autoplot(log(PPI) %>% difference())

PPI_Final_TS %>% gg_tsdisplay(log(PPI) %>% difference(), plot_type = 'partial')

Visually, the data looks stationary. The data is roughly horizontal, has no predictable patterns, and the data is as about as close as possible to constant variance. The acf of the data also drops significantly after the 2nd lag. Based on the acf and pacf plots, I would guess an ARIMA(0,1,2), an ARIMA (2,1,0), or an ARIMA(3,1,0). I allow R to pick the best model between (0:3,1,0:2) and (2:3,1,0:2) as well as automatically without restrictions.

Fit ARIMA MODELS
# Fit ARIMA models
fitARIMAauto <- PPI_Final_TS %>%
model(
auto = ARIMA(log(PPI))
)

report(fitARIMAauto)
Series: PPI 
Model: ARIMA(0,1,2) 
Transformation: log(PPI) 

Coefficients:
         ma1     ma2
      0.0581  0.2351
s.e.  0.0604  0.0571

sigma^2 estimated as 0.00567:  log likelihood=311.47
AIC=-616.93   AICc=-616.84   BIC=-606.18
fitARIMAguess <- PPI_Final_TS %>%
model(
auto = ARIMA(log(PPI) ~ pdq(0:3, 1, 0:2))
)

report(fitARIMAguess)
Series: PPI 
Model: ARIMA(0,1,2) 
Transformation: log(PPI) 

Coefficients:
         ma1     ma2
      0.0581  0.2351
s.e.  0.0604  0.0571

sigma^2 estimated as 0.00567:  log likelihood=311.47
AIC=-616.93   AICc=-616.84   BIC=-606.18
fitARIMAguess2 <- PPI_Final_TS %>%
model(
auto = ARIMA(log(PPI) ~ pdq(2:3, 1, 0:2))
)

report(fitARIMAguess2)
Series: PPI 
Model: ARIMA(3,1,1) 
Transformation: log(PPI) 

Coefficients:
         ar1     ar2      ar3      ma1
      0.7868  0.2096  -0.2825  -0.7609
s.e.  0.1320  0.0748   0.0589   0.1280

sigma^2 estimated as 0.005558:  log likelihood=315.06
AIC=-620.12   AICc=-619.89   BIC=-602.2

The ARIMA model that indicates the best fit with the lowest AICc is ARIMA(3,1,1).

Check the Residuals of ARIMA(3,1,1)
# Plot the residuals
augment(fitARIMAguess2) %>% gg_tsdisplay(.resid, plot_type = "hist")

# Box-Ljung test
Box.test(augment(fitARIMAguess2)$.resid, lag=24,type="Lj")

    Box-Ljung test

data:  augment(fitARIMAguess2)$.resid
X-squared = 21.366, df = 24, p-value = 0.617

There is only one significant lag in the acf function at lag 11. The residuals are centered around zero. Generally the residuals looks like white noise. The Box-Ljung test provides a p-value of 0.617, which means that I fail to reject the null hypothesis concluding that the residuals are not different from white noise.

Confirming with Training Data that the Forecasting Model Chosen is the Correct Model
# Create training data
PPI_train <- PPI_Final_TS %>%
  filter_index(. ~ "2016 Dec")
# Fit training models
fitARIMAautotrain <- PPI_train %>%
model(
auto = ARIMA(log(PPI))
)

report(fitARIMAautotrain)
Series: PPI 
Model: ARIMA(0,1,0) 
Transformation: log(PPI) 

sigma^2 estimated as 0.006588:  log likelihood=221.75
AIC=-441.5   AICc=-441.48   BIC=-438.18
fitARIMAguesstrain <- PPI_train %>%
model(
guess = ARIMA(log(PPI) ~ pdq(0:3, 1, 0:2))
)

report(fitARIMAguesstrain)
Series: PPI 
Model: ARIMA(0,1,0) 
Transformation: log(PPI) 

sigma^2 estimated as 0.006588:  log likelihood=221.75
AIC=-441.5   AICc=-441.48   BIC=-438.18
fitARIMAguess2train <- PPI_train %>%
model(
guess2 = ARIMA(log(PPI) ~ pdq(2:3, 1, 0:2))
)

report(fitARIMAguess2train)
Series: PPI 
Model: ARIMA(3,1,1) 
Transformation: log(PPI) 

Coefficients:
         ar1     ar2      ar3      ma1
      0.7881  0.1846  -0.2672  -0.7628
s.e.  0.1659  0.0861   0.0673   0.1643

sigma^2 estimated as 0.006222:  log likelihood=229.44
AIC=-448.89   AICc=-448.58   BIC=-432.32
fcARIMAautotrain <- fitARIMAautotrain %>% forecast(h = 48)

accuracy(fcARIMAautotrain, PPI_Final_TS)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 auto   Test  -29.1  33.7  29.4 -19.7  19.9 0.892 0.794 0.819
fcARIMAtrain <- fitARIMAguesstrain %>% forecast(h = 48)

accuracy(fcARIMAtrain, PPI_Final_TS)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 guess  Test  -29.1  33.7  29.4 -19.7  19.9 0.892 0.794 0.819
fcARIMAtrain2 <- fitARIMAguess2train %>% forecast(h = 48)

accuracy(fcARIMAtrain2, PPI_Final_TS)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 guess2 Test  -27.2  31.2  27.9 -18.5  18.8 0.846 0.737 0.774
fcARIMAtrain3 <- fitARIMAguess2train %>% forecast(h = 24)

accuracy(fcARIMAtrain3, PPI_Final_TS)
# A tibble: 1 x 10
  .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 guess2 Test  -20.8  24.6  20.8 -13.9  13.9 0.631 0.581 0.756

ARIMA(3,1,1) has the best fit in the training data according to AICc and produces the lowest RMSE value of the three models. When forecasting for the next two year, the MAPE value is 13.91. This infers that the model is good at forecasting the data for the next two years.

Forecast PPI Data for the next two years
FCPPI_Final_TS <- PPI_Final_TS %>%
  model(ARIMAguess = ARIMA(log(PPI) ~ pdq(2:3, 1, 0:2))) %>%
  forecast(h = 24)

PPI_Final_TS %>%
  model(ARIMAguess = ARIMA(log(PPI) ~ pdq(2:3, 1, 0:2))) %>%
  forecast(h = 24) %>%
  autoplot() + labs(title = "PPI Forecast", y = "PPI")

PPI_Final_TS %>%
  model(ARIMAguess = ARIMA(log(PPI) ~ pdq(2:3, 1, 0:2))) %>%
  forecast(h = 24) %>%
  autoplot(PPI_Final_TS) + labs(title = "PPI Forecast", y = "PPI")

Forecast results:
  • March 2022: 279.991
  • April 2022: 286.0398
  • May 2022: 286.3189
  • June 2022: 282.8930
  • July 2022: 278.7200
  • August 2022: 274.8387
  • September 2022: 272.0438
  • October 2022: 270.3512
  • November 2022: 269.6702
  • December 2022: 269.7107
  • January 2023: 270.2241
  • February 2023: 270.9767
  • March 2023: 271.8131
  • March 2024: 278.7061

The updated PPI forecast predicts that PPI will increase until May 2022, then decrease until November 2022, and then slowly increase until March 2024. Even though PPI is decreasing to 269.67, PPI indicates very high producer prices compared to previous years. The highest PPI ever recorded was 289.8, so PPI in the near future will be close to record levels.

Policy Implications

Congress can see that soybean producers are going to receive slightly less for their soybean harvests for the rest of 2022. This may indicate the need for a small subsidy for soybean producers to help them get through the year. However, prices are still very high relative to 2015 through 2019 values. Specifically the forecast values are rougly 100 PPI higher than the values from 2015 to 2019. This may signify a slightly lower dependence on subsidization for soybean producers in the future.

Extra - Soybean Producer Price Index (YoY)

Soybean PPI Plot
# Plot PPI
PPI_Final_TS %>% ggplot(aes(x = YearMonth, y = YoY)) +
  geom_line() + labs(title = "Producer Price Index: Soybeans (YoY)", y = "YoY Inflation")

PPI_Final_TS %>% gg_season(YoY) + labs(title = "Season Plot: Soybeans", y = "YoY Inflation")

Producer Price Index is measured in the data with 1982 as the base year. Producer Price Index measures the prices received by domestic producers of goods and services. The Data used in this section is mutated to capture year-on-year producer price inflation from January, 2000 to March, 2022. This change allows PPI to be analyzed as the relative inflation compared to the same month of the prior year. A negative value for year-on-year inflation signals that producers will receive less income relative to the prior year. In this case, there will be more of a need for subsidization. A positive value for year-on-year inflation signals that producers will receive more income relative to the prior year. In this case, there will be less of a need for subsidization.

In the plot above, year-on-year inflation is volatile with occasional periods of high persistence. The general trend of the data is constant because it is not increasing or decreasing. There is some seasonality as shown above in the seasonal plot, but there is not a pronounced season. There is a cycle where a lows are followed by highs before returning to lows. Due to the volatility of the data, this cycle is inconsistent. An ARIMA model will be used to capture this cycle.

Transform the Data to Make it Stationary
# Check for the need to be seasonally differenced
PPI_FInal_TS1 <- PPI_Final_TS %>% features(YoY, unitroot_nsdiffs)

PPI_FInal_TS1
# A tibble: 1 x 1
  nsdiffs
    <int>
1       0
# Check for the need to be differenced
PPI_FInal_TS1 <- PPI_Final_TS %>% features(YoY, unitroot_ndiffs)

PPI_FInal_TS1
# A tibble: 1 x 1
  ndiffs
   <int>
1      0

The data does not need to be differenced or transformed to be stationary.

Visually Check that the Data is Stationary
# Plot to check that the data is stationary
PPI_Final_TS %>% gg_tsdisplay((YoY) %>% difference(), plot_type = 'partial')

Other than the significant spikes at lags 12 and 24, the data looks stationary. I am going proceed with just using auto ARIMA in this section for the sake of time. Based om the acf and pacf plots, there should be a seasonal component in the auto ARIMA model.

Fit ARIMA Model with Auto
fitARIMAauto <- PPI_Final_TS %>%
model(
auto = ARIMA(YoY)
)

report(fitARIMAauto)
Series: YoY 
Model: ARIMA(3,0,0)(0,0,1)[12] w/ mean 

Coefficients:
         ar1     ar2      ar3     sma1  constant
      0.9556  0.2276  -0.2396  -0.7038    0.0044
s.e.  0.0593  0.0823   0.0599   0.0547    0.0017

sigma^2 estimated as 0.008216:  log likelihood=259.58
AIC=-507.17   AICc=-506.85   BIC=-485.65
Check the Residuals of ARIMA(3,0,0)(0,0,1)[12] w/ mean
# Plot the residuals
augment(fitARIMAauto) %>% gg_tsdisplay(.resid, plot_type = "hist")

# Box-Ljung test
Box.test(augment(fitARIMAauto)$.resid, lag=24,type="Lj")

    Box-Ljung test

data:  augment(fitARIMAauto)$.resid
X-squared = 17.608, df = 24, p-value = 0.8216

The residuals visually resemble white noise. The residuals have constant variance. There are no significant lags in the acf plot. There residuals are centered around zero. The Box-Ljung test confirms that the residuals are not distinguishable from white noise with a p-value of 0.8216.

Forecast Yield Data up to 2024
FCPPI_Final_TS2 <- PPI_Final_TS %>%
  model(
auto = ARIMA(YoY)) %>%
  forecast(h = 24)

PPI_Final_TS %>%
  model(
auto = ARIMA(YoY)) %>%
  forecast(h = 24) %>%
  autoplot() + labs(title = "PPI Forecast (YoY)", y = "YoY Inflation")

PPI_Final_TS %>%
  model(
auto = ARIMA(YoY))  %>%
  forecast(h = 24) %>%
  autoplot(PPI_Final_TS) + labs(title = "PPI Forecast (YoY)", y = "YoY Inflation")

Forecast results:
  • March 2022: 0.1267243461
  • April 2022: 0.154
  • May 2022: -0.070
  • June 2022: 0.102
  • July 2022: 0.131
  • August 2022: 0.129
  • September 2022: 0.234
  • October 2022: 0.307
  • November 2022: 0.311
  • December 2022: 0.273
  • January 2023: 0.197
  • February 2023: 0.088
  • March 2023: 0.0167
  • March 2024: 0.0255

For the rest of 2022, the price soybean producers receive is predicted to be higher than previous years except for the month of May. From April 2023 until October 2023, the price inflation is predicted to be negative. On the other hand, the price inflation is predicted to be positive from November 2023 until March 2024.

Policy Implications

The results here suggest that Congress does not need to institute an additional subsidy to help soybean producers get through the year. This is due to the forecast that soybean producers are predicted to generally receive more in 2022 relative to the same month in 2021. Congress should not worry about negative values from April 2023 until October 2023 because these values are small relative to the increase experienced in 2022 as seen in the plot above. These results confirm the prior section’s general implication for Congress that there might be slightly less dependence on subsidies in the future.

Agricultural Crop 2: Cotton (Francesca)

cotton_yield <- read_csv("cotton_yield.csv")
cotton_prod <- read_csv("cotton_prod.csv")
ppi_cotton <- read_csv("ppi_cotton.csv")
cotton_yield_ts <- cotton_yield  %>%
  as_tsibble(index = Year)
cotton_prod_ts <- cotton_prod %>%
  as_tsibble(index = Year)
ppi_cotton <- ppi_cotton %>%
  mutate(Value = WPU015101)
ppi_cotton_ts <- ppi_cotton %>%
  mutate(month = yearmonth(DATE)) %>%
  as_tsibble(index = month)
ppi_cotton_ts <- ppi_cotton_ts %>%
  mutate(inflation = (Value/lag(Value)-1)*100) %>%
  mutate(inflation = ifelse(is.na(inflation),0,inflation))

Production

About the Data

Looking at the plot of United States Cotton Production, we see a steep upward trend from 1866 until the early 1900s. From the early 1900s on, there is still an upward trend. However, there is a lot of volatility in production quantity. The United States exports more cotton than any other country - so it makes sense that we would produce a lot of it. However, we rank third in total production behind India and China. Cotton production is increasing due to an increased yield (more on that later), population growth, and economic growth.

autoplot(cotton_prod_ts) + labs(title = "United States Cotton Production",x="Year",y="Bales")

United States Cotton Production from 2000 to present

We can see from this chart that cotton production has been volatile over the past twenty years. Cotton production has been volatile in the past twenty years with no clear upward or downward trend. The difference in cotton production between 2021 and 2020 is +21%, however, the difference between 2020 and 2019 is -27%.

cotton_prod_ts_00 <- cotton_prod_ts %>%
  filter(Year>=2000)
autoplot(cotton_prod_ts_00) + labs(x="Year",y="Bales",title = "United States Cotton Production: 2000 to Present")

Transforming the data

I tried a Box-Cox transformation and a logarithmic transformation of the data. The remainder component of the STL decomposition is smallest for the logarithmic transformation, so I will be using that going forward.

lambda_cottonprod <- cotton_prod_ts %>%
  features(Value, features = guerrero) %>%
  pull(lambda_guerrero)
cotton_prod_ts %>%
  autoplot(box_cox(Value, lambda_cottonprod))

cotton_prod_ts %>%
  model(STL(box_cox(Value, lambda_cottonprod))) %>%
  components() %>% autoplot()

cotton_prod_ts %>%
  model(STL(log(Value))) %>%
  components() %>% autoplot

cotton_prod_ts %>%
  model(STL(Value)) %>%
  components() %>% autoplot()

cotton_prod_ts <- cotton_prod_ts %>%
  mutate(log_value = log(Value))
Plotting the transformed values

After plotting the logarithmically transformed production data, we see the same steep upward trend from the late 1800s to the early 1900s and then the volatility of production until 2021.

cotton_prod_ts %>%
  autoplot(log(Value)) + labs(title="United States Cotton Production",x="Year",y="Bales (Logarithmically Transformed)")

Basic forecasting

Using the basic naive, mean, and drift forecasting models, we see that cotton production is forecasted to increase over the next 10 years according to the drift and naive models. According to the mean model, cotton production will stay around 1.25 million bales.

cotton_prod_ts_fit <- cotton_prod_ts %>%
  model(
        `Naive` = NAIVE(log(Value)),
          Mean = MEAN(log(Value)),
          Drift = RW(log(Value) ~ drift()))
# Generate forecasts for 4 years (48 months)
cotton_prod_ts_fc <- cotton_prod_ts_fit %>% forecast(h = 10)
cotton_prod_ts_fc
# A fable: 30 x 4 [1Y]
# Key:     .model [3]
   .model  Year           Value     .mean
   <chr>  <dbl>          <dist>     <dbl>
 1 Naive   2022 t(N(17, 0.046)) 18025228.
 2 Naive   2023 t(N(17, 0.091)) 18426456.
 3 Naive   2024  t(N(17, 0.14)) 18827684.
 4 Naive   2025  t(N(17, 0.18)) 19228912.
 5 Naive   2026  t(N(17, 0.23)) 19630139.
 6 Naive   2027  t(N(17, 0.27)) 20031367.
 7 Naive   2028  t(N(17, 0.32)) 20432595.
 8 Naive   2029  t(N(17, 0.36)) 20833823.
 9 Naive   2030  t(N(17, 0.41)) 21235051.
10 Naive   2031  t(N(17, 0.46)) 21636279.
# ... with 20 more rows
cotton_prod_ts_fc %>%
  autoplot(cotton_prod_ts, level = NULL) +
  labs(title = "Cotton Production Forecast",x="Year",y="Bales") +
   guides(colour = guide_legend(title = "Forecast"))

Differencing

We difference once, then confirm using unitroot_ndiffs that one difference was sufficient.

cotton_prod_ts %>%
  features(log_value,unitroot_ndiffs)
# A tibble: 1 x 1
  ndiffs
   <int>
1      1
cotton_prod_ts_diff <- cotton_prod_ts %>% 
 mutate(d_Value = difference(log_value))  
cotton_prod_ts_diff %>%
  features(d_Value,unitroot_ndiffs)
# A tibble: 1 x 1
  ndiffs
   <int>
1      0
cotton_prod_ts_diff %>% autoplot(d_Value)

Finding an ARIMA model

Here, we are trying to determine the best ARIMA model to forecast. There is a significant spike in the ACF at lag 1. There is also a significant spike in the PACF at lag 2. It seems that ARIMA(1,1,2) is the best option here.

cotton_prod_ts_diff %>% 
  gg_tsdisplay(d_Value, plot_type='partial',lag=48) + labs(title="ACF and PACF Plots: Cotton Production", y="Bales")

#ARIMA(1,1,2)
prod_fitmine <- cotton_prod_ts_diff %>%
  model(model = ARIMA(log_value ~ pdq(1,1,2)))
report(prod_fitmine)
Series: log_value 
Model: ARIMA(1,1,2) w/ drift 

Coefficients:
         ar1      ma1      ma2  constant
      0.1688  -0.6723  -0.0138    0.0104
s.e.  0.5922   0.5958   0.3568    0.0049

sigma^2 estimated as 0.03687:  log likelihood=37.67
AIC=-65.34   AICc=-64.94   BIC=-50.12
Forecasting with ARIMA

After logarithmically transforming the observations and differencing once, we forecast using an ARIMA(1,1,3) model. The forecasted observations show that cotton production is likely to remain steady for the next five years. There is a slight upward trend in the forecasted observations but very slight.

cotton_prod_ts %>%
model(model = ARIMA(log(Value) ~ pdq(1,1,3))) %>%
  forecast(h=5) %>%
  autoplot(cotton_prod_ts_00) + labs(x="Year",y="Bales",title="United States Forecasted Cotton Production")

Residuals

Using the Box-Ljung test, we see that the p-value is larger than .05 meaning we fail to reject the null hypothesis that the whole set of r_k values = 0 in favor of the alternative hypothesis that r_k ≠ 0 (not white noise) at the 95% significance level. Similarly, when we look at the ACF chart of the residuals, none are significantly different than 0. According to the below metrics, the residuals are white noise.

Box.test(augment(prod_fitmine)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(prod_fitmine)$.resid
X-squared = 1.9823, df = 10, p-value = 0.9965
prod_fitmine %>%
  gg_tsresiduals()

Yield

About the Data

Looking at the plot of United States Cotton Yields, we see a steady trend until the 1930s. From there, the U.S. cotton yield increase sharply and we see a sustained upward trend. Yield is an important measure of agricultural productivity. Cotton yields have increased globally due to sustained gains in farming efficiency.
What happened in the 1930s? Plows, planters, mechanical cultivators, and harvesters were developed. Plows were mounted directly to tractors, crop planters got faster and more accurate, mechanical cultivators allowed farmers to drive through closer spaced rows, and the first wheat combine was invented in 1935 and other crop harvesters followed closely behind.

autoplot(cotton_yield_ts) + labs(title = "United States Cotton Yield",x="Year",y="Pounds per Acre")

United States Cotton Yields from 2000 to present

Pounds of cotton harvested per acre has changed by +34% over the past twenty years. Crop yields increase due to sustained gains in farming efficiency through fertilizer application, irrigation, soil tillage, and overall improved farming practices.

cotton_yield_ts_00 <- cotton_yield_ts %>%
  filter(Year>=2000)
  autoplot(cotton_yield_ts_00) + labs(title="United States Cotton Yields: 2000 to 2021",x="Year",y="Pounds per Acre")

Transforming the data

I tried a Box-Cox transformation and a logarithmic transformation of the data. The remainder component of the STL decomposition is smallest for the logarithmic transformation, so I will be using that going forward as I did for the cotton production data.

lambda_cottonyield <- cotton_yield_ts %>%
  features(Value, features = guerrero) %>%
  pull(lambda_guerrero)
cotton_yield_ts %>%
  autoplot(box_cox(Value, lambda_cottonyield))

cotton_yield_ts %>%
  model(STL(box_cox(Value, lambda_cottonyield))) %>%
  components() %>% autoplot()

cotton_yield_ts %>%
  model(STL(log(Value))) %>%
  components() %>% autoplot()

cotton_yield_ts %>%
  model(STL(Value)) %>%
  components() %>% autoplot()

cotton_yield_ts <- cotton_yield_ts %>%
  mutate(log_value = log(Value))
Plotting the transformed values

After plotting the logarithmically transformed production data, we see the same stable trend until the 1930s, then a sustained upward trend.

cotton_yield_ts %>%
  autoplot(log(Value)) + labs(title="United States Cotton Yields",x="Year",y="Hectograms per Hectare")

Basic forecasting

Using the basic naive, mean, and drift forecasting models, we see that cotton yield is forecasted to increase over the next 10 years according to the drift and naive models. According to the mean model, cotton production will stay around 375 hectograms per hectare.

cotton_yield_ts_fit <- cotton_yield_ts %>%
  model(
        `Naive` = NAIVE(log(Value)),
          Mean = MEAN(log(Value)),
          Drift = RW(log(Value) ~ drift()))
# Generate forecasts for 4 years (48 months)
cotton_yield_ts_fc <- cotton_yield_ts_fit %>% forecast(h =10)
cotton_yield_ts_fc %>%
  autoplot(cotton_yield_ts, level = NULL) +
  labs(title = "Cotton Yield Forecast", x="Year",y="Hectograms per Hectare") +
   guides(colour = guide_legend(title = "Forecast"))

Differencing

We difference once, then confirm using unitroot_ndiffs that one difference was sufficient.

cotton_yield_ts %>%
  features(log_value,unitroot_ndiffs)
# A tibble: 1 x 1
  ndiffs
   <int>
1      1
cotton_yield_ts_diff <- cotton_yield_ts %>% 
 mutate(d_Value = difference(log_value))  
cotton_yield_ts_diff %>%
  features(d_Value,unitroot_ndiffs)
# A tibble: 1 x 1
  ndiffs
   <int>
1      0
cotton_yield_ts_diff %>% autoplot(d_Value)

Finding an ARIMA model

Here, we are trying to determine the best ARIMA model to forecast. There is a significant spike in the ACF at lag 1. There is also a significant spike in the PACF at lag 2, but it is marginally significant. Let’s try ARIMA(1,1,2) and ARIMA(1,1,1). According to the AICc values, the ARIMA(1,1,2) model is the best option.

cotton_yield_ts_diff %>% 
  gg_tsdisplay(d_Value, plot_type='partial',lag=48) + labs(title="ACF and PACF Plots: Cotton Yield", y="Bales")

#ARIMA(1,1,1)
yield_fitmine111<- cotton_yield_ts_diff %>%
  model(model = ARIMA(log_value ~ pdq(1,1,1)))
report(yield_fitmine111)
Series: log_value 
Model: ARIMA(1,1,1) w/ drift 

Coefficients:
          ar1      ma1  constant
      -0.2072  -0.4626    0.0144
s.e.   0.1322   0.1232    0.0048

sigma^2 estimated as 0.01266:  log likelihood=119.95
AIC=-231.9   AICc=-231.64   BIC=-219.73
yield_fitmine112<- cotton_yield_ts_diff %>%
  model(model = ARIMA(log_value ~ pdq(1,1,2)))
report(yield_fitmine112)
Series: log_value 
Model: ARIMA(1,1,2) w/ drift 

Coefficients:
         ar1      ma1     ma2  constant
      0.5972  -1.2559  0.4546    0.0048
s.e.  0.3465   0.3406  0.2111    0.0018

sigma^2 estimated as 0.01279:  log likelihood=119.67
AIC=-229.35   AICc=-228.94   BIC=-214.13
Forecasting with ARIMA

After logarithmically transforming the observations and differencing once, we forecast using an ARIMA(1,1,2) model. The forecasted observations show that cotton yields are likely to slightly continue increasing over the next five years. This is similar to our forecast for production quantity.

cotton_yield_ts %>%
model(model = ARIMA(log(Value) ~ pdq(1,1,2))) %>%
  forecast(h=5) %>%
  autoplot(cotton_yield_ts_00) + labs(x="Year",y="Hectograms per Hectares",title="United States Forecasted Cotton Yield")

Residuals

Using the Box-Ljung test, we see that the p-value is larger than .05 meaning we fail to reject the null hypothesis that the whole set of r_k values = 0 in favor of the alternative hypothesis that r_k ≠ 0 (not white noise) at the 95% significance level. Similarly, when we look at the ACF chart of the residuals, none are significantly different than 0. According to the below metrics, the residuals are white noise.

Box.test(augment(yield_fitmine112)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(yield_fitmine112)$.resid
X-squared = 5.2945, df = 10, p-value = 0.8707
yield_fitmine112 %>%
  gg_tsresiduals()

Producer Price Index (PPI) and Inflation

About the Data

United States Cotton PPI is indexed to year 1982 (= 100). Below we graphed both the raw PPI data and the inflation data calculated from PPI: PPI_current/PPI_prevmonth - 1 * 100. United States Cotton PPI is increasing, but there is no clear trend. There seems to have been a lot of volatility over the years, with a significant recent spike beginning in January 2020. Additionally, inflation of cotton producer price has remained relatively stable over the years, with a huge spike in the mid-1980s. There was widespread inflation across many industries, so this makes sense. Going forward, I will be analyzing only inflation.

autoplot(ppi_cotton_ts,Value) + labs(title = "United States Cotton Producer Price Index",x="Year",y="PPI")

autoplot(ppi_cotton_ts,inflation) + labs(title = "United States Cotton Producer Price Index",x="Year",y="Inflation (Percent)")

United States Cotton Inflation from 2000 to present

We can see that cotton inflation in the U.S. has stayed steady over the past couple of decades, albeit volatile in between. PPI is considered a “forward-looking” measure of inflation - this means that since PPI measures price changes before they actually reach consumers, it can be an earlier predictor of inflation than CPI. This extends to the inflation percentage calculated from PPI.

ppi_cotton_ts_00 <- ppi_cotton_ts %>%
  filter(year(month)>=2000)
autoplot(ppi_cotton_ts_00,inflation) + labs(x = "Year",y="Inflation (Percent)",title = "United States Cotton Inflation: 2000 to Present")

Transforming the data

I tried a Box-Cox transformation of the data. However, for the inflation data, we cannot do a logarithmic transformation because we have a 0 value. The remainder component of the STL decomposition is smallest for the Box-Cox transformation, so I will be using that going forward.

#Inflation
lambda_inflationcotton <- ppi_cotton_ts %>%
  features(inflation, features = guerrero) %>%
  pull(lambda_guerrero)
ppi_cotton_ts %>%
  autoplot(box_cox(inflation, lambda_inflationcotton))

ppi_cotton_ts %>%
  model(STL(box_cox(inflation, lambda_inflationcotton))) %>%
  components() %>% autoplot()

ppi_cotton_ts %>%
  model(STL((inflation))) %>%
  components() %>% autoplot()

ppi_cotton_ts <- ppi_cotton_ts %>%
  mutate(boxcox_value = box_cox(inflation, lambda_inflationcotton))
Plotting the transformed values

After plotting the Box-Cox transformed production data, we see the same stable trend for inflation.

ppi_cotton_ts %>% 
  autoplot(box_cox(inflation, lambda_inflationcotton)) + labs(title="United States Cotton Producer Price Index",x="Year",y="Inflation (Percent)")

Basic forecasting

For cotton inflation, the drift and naive models predict that cotton inflation will drastically decrease over the next 5 years. This is surprising to say the least, since there is a forecast that cotton inflation will drop lower than it ever has before. The seasonal naive model forecasts a similar cyclical pattern that is increasing in size over time, and the mean model predicts that inflation will hover just above 0 for the next 5 years.

ppi_cotton_ts_fit2 <- ppi_cotton_ts %>%
  model(`Seasonal naive` = SNAIVE(box_cox(inflation, lambda_inflationcotton)),
        `Naive` = NAIVE(box_cox(inflation,lambda_inflationcotton)),
          Mean = MEAN(box_cox(inflation,lambda_inflationcotton)),
          Drift = RW(box_cox(inflation,lambda_inflationcotton) ~ drift()))
# Generate forecasts for 10 years (120 months)
ppi_cotton_ts_fc2 <- ppi_cotton_ts_fit2 %>% forecast(h=5*12)
ppi_cotton_ts_fc2 %>%
  autoplot(ppi_cotton_ts, level = NULL) +
  labs(title = "Cotton Inflation Forecast") +
   guides(colour = guide_legend(title = "Forecast",x="Month",y="Inflation (Percent)"))

Differencing

Neither unitroot_ndiffs nor unitroot_nsdiffs suggests that differencing is necessary in inflation’s case.

ppi_cotton_ts %>%
  features(box_cox(inflation, lambda_inflationcotton),unitroot_ndiffs)
# A tibble: 1 x 1
  ndiffs
   <int>
1      0
ppi_cotton_ts %>%
  features(box_cox(inflation,lambda_inflationcotton),unitroot_nsdiffs)
# A tibble: 1 x 1
  nsdiffs
    <int>
1       0
Finding an ARIMA model

Here, we are trying to determine the best ARIMA model to forecast. The ACF and PACF plots look really similar. There are several significant spikes in the ACF and PACF charts but they do seem to be maringally significant. The auto ARIMA model is an ARIMA(0,0,1). I also try an ARIMA(0,0,0). It looks like these have similar AICc values, with the AICc value of the auto generated ARIMA model being a slight bit smaller. We also try an ETS model which does not perform well.

ppi_cotton_ts %>% 
  gg_tsdisplay(inflation, plot_type='partial',lag=48) + labs(title="ACF and PACF Plots: Cotton Yield", y="Bales")

ppi_fitauto <-ppi_cotton_ts %>%
  model(auto = ARIMA(box_cox(inflation, lambda_inflationcotton) ~ pdq()))
ppi_fitmine000 <- ppi_cotton_ts %>%
  model(arima = ARIMA(box_cox(inflation,lambda_inflationcotton) ~ pdq(0,0,0)))
ppi_models <- ppi_cotton_ts %>%
  model(auto = ARIMA(box_cox(inflation, lambda_inflationcotton) ~ pdq()),
        arima = ARIMA(box_cox(inflation,lambda_inflationcotton) ~ pdq(0,0,0)),
        ets = ETS(box_cox(inflation,lambda_inflationcotton)))
report(ppi_fitauto)
Series: inflation 
Model: ARIMA(0,0,1) w/ mean 
Transformation: box_cox(inflation, lambda_inflationcotton) 

Coefficients:
         ma1  constant
      0.0607   -1.2447
s.e.  0.0408    0.2011

sigma^2 estimated as 20.46:  log likelihood=-1659.28
AIC=3324.56   AICc=3324.6   BIC=3337.58
report(ppi_fitmine000)
Series: inflation 
Model: ARIMA(0,0,0) w/ mean 
Transformation: box_cox(inflation, lambda_inflationcotton) 

Coefficients:
      constant
       -1.2446
s.e.    0.1900

sigma^2 estimated as 20.5:  log likelihood=-1660.39
AIC=3324.78   AICc=3324.8   BIC=3333.46
report(ppi_models)
# A tibble: 3 x 11
  .model sigma2 log_lik   AIC  AICc   BIC ar_roots  ma_roots    MSE  AMSE   MAE
  <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>    <list>    <dbl> <dbl> <dbl>
1 auto     20.5  -1659. 3325. 3325. 3338. <cpl [0]> <cpl [1]>  NA    NA   NA   
2 arima    20.5  -1660. 3325. 3325. 3333. <cpl [0]> <cpl [0]>  NA    NA   NA   
3 ets      20.5  -2653. 5313. 5313. 5326. <NULL>    <NULL>     20.5  20.5  3.84
Forecasting with ARIMA

After transforming the observations using Box-Cox, we forecast using an ARIMA(0,0,1) model as well as an ARIMA(0,0,0) model as generated automatically. Neither of the forecasts look amazing. There is so much volatility in the inflation rate over the past twenty years, but the 5 year forecast does not reflect that. This time series seems to warrant a deeper dive.

ppi_cotton_ts %>%
model(arima = ARIMA(box_cox(inflation,lambda_inflationcotton) ~ pdq(0,0,0)),
     auto = ARIMA(box_cox(inflation,lambda_inflationcotton) ~ pdq())) %>%
  forecast(h=5*12) %>%
  autoplot(ppi_cotton_ts_00) + labs(x="Year",y="Inflation (Present)",title="United States Forecasted Cotton Inflation")

Residuals

The following notes are for both the auto ARIMA(0,0,1) model and the ARIMA(0,0,0) model.
Using the Box-Ljung test, we see that the p-value is more than .05 meaning we fail reject the null hypothesis that the whole set of r_k values = 0 at the 95% significance level for both ARIMA models. I am a bit perplexed by the results here because there are some lags that are significantly different than 0 in the ACF chart and the histogram does not look to be centered around 0. However, the Box-Ljung test shows the residuals to be white noise.

Box.test(augment(ppi_fitauto)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(ppi_fitauto)$.resid
X-squared = 16.381, df = 10, p-value = 0.08924
ppi_fitauto %>%
  gg_tsresiduals()

Box.test(augment(ppi_fitmine000)$.resid, lag=10,type="Lj")

    Box-Ljung test

data:  augment(ppi_fitmine000)$.resid
X-squared = 16.643, df = 10, p-value = 0.08265
ppi_fitmine000 %>%
  gg_tsresiduals()

Conclusion

Why does agricultural forecasting, specifically cotton forecasting in terms of production quantity, yield, and PPI matter?

Production decisions are made by farmers before these three items are known. A farmer does not know nationwide production quantity, the increase in yield she will see from recent technological farming advancements, or the producer price index in the upcoming farming season. This is true for all crops, not just for cotton. Forecasting can help farmers decide how much to produce or level-set their expectations. If production is going to decline (or yield, or PPI, or inflation), a farmer would likely need to reduce costs in order to sustain operations. If production is going to increase (or yield, or PPI), a farmer would likely want to produce more in order to net gain from this increase.

In terms of government policy, agricultural subsidy and aid programs run by the USDA are interested in forecasting because it helps them also level set their expectations. How much aid should we expect to provide to farms this year? Which crops will we need to subsidize?

According to our forecasts, it is likely that yield and production quantity will not significantly change. However, cotton PPI forecast indicates that relative prices of cotton will decline. Because production quantity and yield are not forecast to increase by much, the decline in relative price may put farmers in a sticky situation. This could cause need for federal aid and cotton subsidies. However, it is important to keep in mind that inflation is forecast to stay about the same, and while there has been volatility in the past, inflation has not been steadily trending upward or downward.

I want to take a moment to discuss the plight of the small American farmer. Looking at these charts and forecasts, it’s easy to think that our farmers are thriving due to increased PPI, stable inflation, increased production quantity, and increased yield. However, it’s unfortunately a fact that big farm is thriving and many of our small family farms are barely surviving. Technology has increased farming efficiency tremendously, but due to economies of scale, these gains are benefiting mostly corporate farms. I would be remiss if I did not mention our American farm families.

Overall, forecasts show that yield and production quantity will stay steady with inflation staying about the same as well.

Agricultural Crop 3: Rice (Mrunali)

Introduction

Agriculture, food, and related industries contribute to 5% of the US Gross Domestic Product (GDP). Rice is one of the 5 major crops produced by the US. 45-50% of the crop is exported each year, and the USA is the 5th largest exporter of rice.

Rice production forecasts help the USDA understand if there will be a surplus or scarcity of rice, and the federal government can accordingly draft agriculture policies and award subsidies to farmers to avoid such a situation. Subsidies typically focus on stabilizing the food chain or ensuring adequate income for farmers. This federal support of specific agricultural crops determines their availability and consumption, which in turn impacts the average American’s diet and nutrition.

Rice production is analyzed and forecasted in terms of yield, quantity produced, and producer price index. For these 3 time series, we look at:

  • What does the future look like for rice production?

  • How is the latest trend different from the past?

  • What are the implications for our forecast on the agricultural policy and subsidies?

Yield

About the Data

Data description:

  • Time period: 1960 - 2020
  • Time interval: Year
  • Total observations: 60
  • No missing data
rice_yield <- readr::read_csv("FAOSTAT_data_5-3-2022_RICE_YIELD.csv") %>%
  select(Year, Value) %>%
  as_tsibble(index=Year)

rice_yield$Value <- as.numeric(rice_yield$Value)

glimpse(rice_yield)
Rows: 60
Columns: 2
$ Year  <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971~
$ Value <dbl> 38227, 41785, 44449, 45906, 47693, 48442, 50851, 49600, 48399, 5~
count_gaps(rice_yield)
# A tibble: 0 x 3
# ... with 3 variables: .from <dbl>, .to <dbl>, .n <int>
rice_yield %>% autoplot() + labs(title = "Rice Yield (1960-2020)", y="Yield (in hg/ha)")

Train-Test Data Split

An 80-20 split is used for the training and test data

split_id = round(nrow(rice_yield) * 0.8)
split_year = rice_yield[split_id, 1]

rice_yield_train <- rice_yield %>%
  filter(Year <= split_year$Year)

print(paste("Training data observations = ", nrow(rice_yield_train)))
[1] "Training data observations =  48"
rice_yield_test <- rice_yield %>% 
  filter(Year > split_year$Year)

print(paste("Test data observations = ", nrow(rice_yield_test)))
[1] "Test data observations =  12"
Trend
rice_yield_train %>% 
  model(STL(Value)) %>%
  components() %>%
  autoplot() + labs(title = "STL Decomposition: Rice Yield", y="Yield (in hg/ha)")

  • Upward trend
  • No seasonality as data is yearly
  • Remainder resembles white noise
ACF and PACF plots
rice_yield_train %>% 
  gg_tsdisplay(Value, plot_type="partial", lag_max=48) + labs(title = "ACF and PACF Plots: Rice Yield", y="Yield (in hg/ha)")

rice_yield_train %>% 
  gg_tsdisplay(difference(Value), plot_type="partial", lag_max=48) + labs(title = "ACF and PACF plots: Rice Yield with Differentiation", y="Yield (in hg/ha)")

  • PACF plot has a significant spike at lag 1 only
  • ACF plot is sinusoidal, and multiple lags are significant
  • The differenced series resembles white noise, with the lags largely uncorrelated

Hence, ETS with trend component, MA(1) with lag-1 differencing and auto ARIMA models are evaluated below

Model Creation and Training Accuracy
fit <- rice_yield_train %>%
  model(ets = ETS(Value),
        arima = ARIMA(Value ~ pdq(0, 1, 1)),
        auto = ARIMA(Value))

fit %>% 
  select(arima) %>%
  report()
Series: Value 
Model: ARIMA(0,1,1) w/ drift 

Coefficients:
          ma1  constant
      -0.3935  820.6476
s.e.   0.1438  229.4626

sigma^2 estimated as 6826051:  log likelihood=-435.55
AIC=877.11   AICc=877.67   BIC=882.66
fit %>%
  select(auto) %>%
  report()
Series: Value 
Model: ARIMA(0,1,1) w/ drift 

Coefficients:
          ma1  constant
      -0.3935  820.6476
s.e.   0.1438  229.4626

sigma^2 estimated as 6826051:  log likelihood=-435.55
AIC=877.11   AICc=877.67   BIC=882.66
fit %>% accuracy()
# A tibble: 3 x 10
  .model .type        ME  RMSE   MAE     MPE  MAPE  MASE RMSSE    ACF1
  <chr>  <chr>     <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
1 ets    Training -386.  2631. 2144. -0.888   3.71 0.871 0.926 -0.0363
2 arima  Training   39.3 2530. 2098. -0.0168  3.59 0.852 0.890  0.0241
3 auto   Training   39.3 2530. 2098. -0.0168  3.59 0.852 0.890  0.0241

Based on the training accuracy, ARIMA(0, 1, 1) is the best model

Forecasts and Forecast Accuracy
fc <- forecast(fit, h=nrow(rice_yield_test)) 

fc %>% accuracy(rice_yield_test)
# A tibble: 3 x 10
  .model .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima  Test   -793. 2590. 2159. -1.02  2.63   NaN   NaN 0.370
2 auto   Test   -793. 2590. 2159. -1.02  2.63   NaN   NaN 0.370
3 ets    Test  -1381. 2929. 2519. -1.72  3.06   NaN   NaN 0.397
fc %>% 
  filter(.model %in% c("arima", "ets")) %>%
  autoplot(rice_yield) + labs(title = "Forecasts: Rice Yield", y="Yield (in hg/ha)")

Based on the forecast accuracy and plot, ARIMA(0, 1, 1) is the best model

Residual Analysis
fit %>% 
  augment() %>% 
  features(.innov, box_pierce, lag = 10, dof = 2)
# A tibble: 3 x 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 arima     6.10     0.636
2 auto      6.10     0.636
3 ets       5.03     0.754
fit %>% 
  augment() %>% 
  features(.innov, ljung_box, lag = 10, dof = 2)
# A tibble: 3 x 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 arima     7.57     0.476
2 auto      7.57     0.476
3 ets       6.12     0.634
fit %>% 
  select(arima) %>%
  gg_tsresiduals() + labs(title = "Residual Analysis (ARIMA): Rice Yield")

For the ARIMA(0, 1, 1) model, the residuals are uncorrelated, mean-0 centered and do not have a normal distribution. Hence, we can use the point forecasts but the prediction interval forecasts may not be accurate

This confirms that the model is fairly good for forecasting rice yield

Five-Year Forecast
fit <- rice_yield %>%
  model(arima = ARIMA(Value ~ pdq(0, 1, 1)))

fc <- forecast(fit, h=5) 

fc %>% autoplot(rice_yield) + labs(title = "Five-Year Forecast: Rice Yield", y="Yield (in hg/ha)")

fc
# A fable: 5 x 4 [1Y]
# Key:     .model [1]
  .model  Year             Value  .mean
  <chr>  <dbl>            <dist>  <dbl>
1 arima   2021 N(86242, 6897818) 86242.
2 arima   2022 N(87018, 9441395) 87018.
3 arima   2023 N(87794, 1.2e+07) 87794.
4 arima   2024 N(88570, 1.5e+07) 88570.
5 arima   2025 N(89346, 1.7e+07) 89346.

Five-Year Point Forecasts:

  • 2021: 86,242 hg/ha
  • 2022: 87,018 hg/ha
  • 2023: 87,794 hg/ha
  • 2024: 88,570 hg/ha
  • 2025: 89,346 hg/ha

The forecasts follow a continuous upward trend, with year-on-year growth of 1%

Production Quantity

About the Data

Data description:

  • Time period: 1960 - 2020
  • Time interval: Year
  • Total observations: 60
  • No missing data
rice_produced_qty <- readr::read_csv("FAOSTAT_data_5-3-2022_RICE_PROD_QTY.csv") %>%
  select(Year, Value) %>%
  as_tsibble(index=Year)

rice_produced_qty$Value <- as.numeric(rice_produced_qty$Value)

glimpse(rice_produced_qty)
Rows: 60
Columns: 2
$ Year  <dbl> 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971~
$ Value <dbl> 2458000, 2996000, 3187000, 3319000, 3460000, 3856422, 4054142, 4~
count_gaps(rice_produced_qty)
# A tibble: 0 x 3
# ... with 3 variables: .from <dbl>, .to <dbl>, .n <int>
rice_produced_qty %>% 
  autoplot() + labs(title = "Rice Production Quantity (1960-2020)", y="Quantity (in tonnes)")

Train-Test data split

An 80-20 split is used for the training and test data

split_id = round(nrow(rice_produced_qty) * 0.8)
split_year = rice_produced_qty[split_id, 1]

rice_produced_qty_train <- rice_produced_qty %>%
  filter(Year <= split_year$Year)

print(paste("Training data observations = ", nrow(rice_produced_qty_train)))
[1] "Training data observations =  48"
rice_produced_qty_test <- rice_produced_qty %>% 
  filter(Year > split_year$Year)

print(paste("Test data observations = ", nrow(rice_produced_qty_test)))
[1] "Test data observations =  12"
Trend
rice_produced_qty_train %>% 
  model(STL(Value)) %>%
  components() %>%
  autoplot() + labs(title = "STL Decomposition: Rice Production Quantity", y="Quantity (in tonnes)")

  • Upward trend
  • No seasonality as data is yearly
  • Remainder resembles white noise
ACF and PACF plots
rice_produced_qty_train %>%
  gg_tsdisplay(Value, plot_type="partial", lag_max=48) + labs(title = "ACF and PACF Plots: Rice Production Quantity", y="Quantity (in tonnes))")

rice_produced_qty_train %>%
  gg_tsdisplay(difference(Value), plot_type="partial", lag_max=48) + labs(title = "ACF and PACF Plots: Rice Production Quantity with Differentiation", y="Quantity (in tonnes))")

  • PACF plot has a significant spike at lag 1 only
  • ACF plot is sinusoidal, and multiple lags are significant
  • The differenced series resembles white noise, with the lags largely uncorrelated (most significant spike for ACF plot at lag-10 and for PACF plot at lag-2)

Hence, ETS with trend component, MA(1) with lag-1 differencing and auto ARIMA models are evaluated below

Model Creation and Training Accuracy
fit <- rice_produced_qty_train %>%
  model(ets = ETS(Value),
        arima = ARIMA(Value ~ pdq(0, 1, 1)),
        auto = ARIMA(Value))

fit %>% 
  select(arima) %>%
  report()
Series: Value 
Model: ARIMA(0,1,1) 

Coefficients:
          ma1
      -0.4231
s.e.   0.1333

sigma^2 estimated as 6.814e+11:  log likelihood=-706.6
AIC=1417.19   AICc=1417.47   BIC=1420.89
fit %>%
  select(auto) %>%
  report()
Series: Value 
Model: ARIMA(2,1,0) w/ drift 

Coefficients:
          ar1     ar2  constant
      -0.4328  -0.347  247905.4
s.e.   0.1355   0.133  113126.8

sigma^2 estimated as 6.251e+11:  log likelihood=-703.61
AIC=1415.22   AICc=1416.17   BIC=1422.62
fit %>% accuracy()
# A tibble: 3 x 10
  .model .type          ME    RMSE     MAE    MPE  MAPE  MASE RMSSE    ACF1
  <chr>  <chr>       <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
1 ets    Training -100992. 758747. 555280. -2.61   9.17 0.842 0.864  0.0476
2 arima  Training  241138. 808069. 596401.  3.42   9.72 0.905 0.920 -0.0691
3 auto   Training    4818. 756984. 562909. -0.539  9.10 0.854 0.862 -0.0360

Based on the training accuracy, ARIMA(2, 1, 0) is the best model

Forecasts and Forecast Accuracy
fc <- forecast(fit, h=nrow(rice_produced_qty_test))

fc %>% accuracy(rice_produced_qty_test)
# A tibble: 3 x 10
  .model .type        ME     RMSE      MAE    MPE  MAPE  MASE RMSSE    ACF1
  <chr>  <chr>     <dbl>    <dbl>    <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>
1 arima  Test    215747.  954821.  870224.   1.33  9.13   NaN   NaN -0.510 
2 auto   Test   -677101. 1291426. 1056460.  -8.31 11.9    NaN   NaN -0.111 
3 ets    Test  -1326886. 1781347. 1553681. -15.3  17.4    NaN   NaN  0.0282
fc %>% autoplot(rice_produced_qty) + labs(title = "Forecasts: Rice Production Quantity", y="Quantity (in tonnes)")

Based on the forecast accuracy and plot, ARIMA(0, 1, 1) is the best model

Residual Analysis
fit %>% 
  augment() %>% 
  features(.innov, box_pierce, lag = 10, dof = 2)
# A tibble: 3 x 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 arima    14.0     0.0808
2 auto      9.32    0.316 
3 ets      15.2     0.0549
fit %>% 
  augment() %>% 
  features(.innov, ljung_box, lag = 10, dof = 2)
# A tibble: 3 x 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 arima     17.0    0.0300
2 auto      11.4    0.181 
3 ets       18.3    0.0193
fit %>% 
  select(arima) %>%
  gg_tsresiduals() + labs(title = "Residual Analysis (ARIMA): Rice Production Quantity")

For the ARIMA(0, 1, 1) model, the residuals resemble white noise based on Box-Pierce test, are largely uncorrelated, close to mean-0 centered and do not have a normal distribution. Hence, we may use the point forecasts but the prediction interval forecasts may not be accurate

This confirms that the model is fairly good for forecasting rice production quantity

Five-Year Forecast
fit <- rice_produced_qty %>%
  model(arima = ARIMA(Value ~ pdq(0, 1, 1)))

fc <- forecast(fit, h=5) 

fc %>% autoplot(rice_produced_qty) + labs(title = "Five-Year Forecast: Rice Production Quantity", y="Quantity (in tonnes)")

fc
# A fable: 5 x 4 [1Y]
# Key:     .model [1]
  .model  Year               Value     .mean
  <chr>  <dbl>              <dist>     <dbl>
1 arima   2021 N(9869139, 7.4e+11)  9869139.
2 arima   2022     N(1e+07, 8e+11)  9985716.
3 arima   2023   N(1e+07, 8.5e+11) 10102293.
4 arima   2024   N(1e+07, 9.1e+11) 10218870.
5 arima   2025   N(1e+07, 9.7e+11) 10335447.

Five-Year Point Forecasts:

  • 2021: 9,869,139 tonnes
  • 2022: 9,985,716 tonnes
  • 2023: 10,102,293 tonnes
  • 2024: 10,218,870 tonnes
  • 2025: 10,335,447 tonnes

The forecasts follow a continuous upward trend, with year-on-year growth of 1%

Producer Price Index (YoY)

About the Data

Data description:

  • Series ID and link: WPU01230103
  • Time period: 2010 Jan - 2022 Mar
  • Time interval: Month
  • Total observations: 135
  • No missing data
  • Month-on-Month % change series resembles a white noise series with mean close to 0 and hence cannot be used for forecasting
  • Year-on-Year % change series is used for further analysis
rice_ppi <- readr::read_csv("WPU01230103_RICE_PPI.csv") %>%
  mutate('MonthYear' = yearmonth(DATE)) %>%
  rename(Value = WPU01230103) %>%
  select(MonthYear, Value) %>%
  as_tsibble(index=MonthYear) %>%
  filter(MonthYear >= yearmonth("2010 Jan"))
  
rice_ppi$Value <- as.numeric(rice_ppi$Value)

rice_ppi <- rice_ppi %>%
  mutate(MoM = (Value - lag(Value)) / lag(Value) * 100,
    YoY = (Value - lag(Value, 12)) / lag(Value, 12) * 100) 

rice_ppi <- rice_ppi %>% drop_na()

nrow(rice_ppi)
[1] 135
glimpse(rice_ppi)
Rows: 135
Columns: 4
$ MonthYear <mth> 2011 Jan, 2011 Feb, 2011 Mar, 2011 Apr, 2011 May, 2011 Jun, ~
$ Value     <dbl> 173.6, 177.7, 175.0, 175.0, 166.9, 158.8, 170.9, 183.0, 192.~
$ MoM       <dbl> 0.7544980, 2.3617512, -1.5194147, 0.0000000, -4.6285714, -4.~
$ YoY       <dbl> -10.4231166, -8.3075335, -7.1125265, -5.7619817, -8.1452944,~
count_gaps(rice_ppi)
# A tibble: 0 x 3
# ... with 3 variables: .from <mth>, .to <mth>, .n <int>
rice_ppi %>% autoplot(Value) + labs(title = "Rice Production Price Index (2010-2022)")

rice_ppi %>% autoplot(MoM) + labs(title = "Rice PPI Month-on-Month % Change", y="% change")

rice_ppi %>% autoplot(YoY) + labs(title = "Rice PPI Year-on-Year % Change", y="% change")

Train-Test data split

An 80-20 split is used for the training and test data

split_id = round(nrow(rice_ppi) * 0.8)
split_yearmonth = rice_ppi[split_id, 1]

rice_ppi_train <- rice_ppi %>%
  filter(MonthYear <= split_yearmonth$MonthYear)

print(paste("Training data observations = ", nrow(rice_ppi_train)))
[1] "Training data observations =  108"
rice_ppi_test <- rice_ppi %>% 
  filter(MonthYear > split_yearmonth$MonthYear)

print(paste("Test data observations = ", nrow(rice_ppi_test)))
[1] "Test data observations =  27"
Seasonality & Trend
rice_ppi_train %>% 
  model(STL(YoY)) %>%
  components() %>%
  autoplot() + labs(title = "STL Decomposition: Rice Production Price Index (YoY)", y="% change")

rice_ppi_train %>%
  gg_season(YoY) + labs(y="", title="Seasonal Plot: Rice Production Price Index (YoY)")

  • No seasonality
  • No overall trend (Trend patterns fluctuate with time around mean-0)
  • Remainder is mean-0 centered with non-constant variance
ACF and PACF plots
rice_ppi_train %>% 
  gg_tsdisplay(YoY, plot_type="partial", lag_max=24) + labs(title = "ACF and PACF Plots: Rice Production Price Index (YoY)")

  • PACF plot has a significant spike at lag 1, with smaller spikes at lags 2, 10, 13
  • ACF plot is sinusoidal, and multiple lags are significant

Hence, MA(1)/MA(2) and auto ARIMA models are evaluated below

Model Creation and Training Accuracy
fit <- rice_ppi_train %>%
  model(arima = ARIMA(YoY ~ pdq(0, 1:13, 1:2) + PDQ(0, 0, 0)),
        auto = ARIMA(YoY))

fit %>% 
  select(arima) %>% 
  report()
Series: YoY 
Model: ARIMA(0,1,2) 

Coefficients:
         ma1     ma2
      0.4040  0.4575
s.e.  0.0875  0.1086

sigma^2 estimated as 20.08:  log likelihood=-311.57
AIC=629.13   AICc=629.36   BIC=637.15
fit %>% 
  select(auto) %>% 
  report()
Series: YoY 
Model: ARIMA(1,0,2)(2,0,0)[12] 

Coefficients:
         ar1     ma1     ma2     sar1     sar2
      0.9176  0.3446  0.3456  -0.6183  -0.1795
s.e.  0.0418  0.1061  0.1102   0.1098   0.1106

sigma^2 estimated as 13.94:  log likelihood=-296.61
AIC=605.22   AICc=606.05   BIC=621.32
fit %>% accuracy()
# A tibble: 2 x 10
  .model .type        ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
  <chr>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
1 arima  Training 0.0878  4.42  3.33   Inf   Inf 0.198 0.210 -0.0507
2 auto   Training 0.0765  3.65  2.77   Inf   Inf 0.165 0.173 -0.0123

Based on the training accuracy, ARIMA(1, 0, 2)(2, 0, 0) is the best model

Forecasts and Forecast Accuracy
fc <- forecast(fit, h=nrow(rice_ppi_test)) 

fc %>% accuracy(rice_ppi_test)
# A tibble: 2 x 10
  .model .type    ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 arima  Test   1.57  6.01  5.04 -464.  592.    NaN   NaN 0.773
2 auto   Test   7.39  8.79  7.44   92.5  92.5   NaN   NaN 0.644
fc %>% autoplot(rice_ppi) + labs(title = "Forecasts: Rice Production Price Index")

Based on the forecast accuracy and plot, ARIMA(0, 1, 2) is the better model. However, ARIMA(1, 0, 2)(2, 0, 0) model provides good forecasts as well, with smaller prediction intervals.

Residual Analysis
fit %>% 
  augment() %>%
  features(.innov, box_pierce, lag = 24, dof = 2)
# A tibble: 2 x 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 arima     38.2    0.0173
2 auto      17.4    0.742 
fit %>% 
  augment() %>%
  features(.innov, ljung_box, lag = 24, dof = 2)
# A tibble: 2 x 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 arima     44.1   0.00345
2 auto      20.6   0.544  
fit %>% 
  select(auto) %>%
  gg_tsresiduals() + labs(title = "Residual Analysis (ARIMA): Rice Production Price Index")

For the ARIMA(1, 0, 2)(2, 0, 0) model, the residuals resemble white noise, are uncorrelated, mean-0 centered and have a close to normal distribution. Hence, the model suggests that we may use the point forecasts but the prediction interval forecasts may not be reliable

Hence, we use ARIMA(1, 0, 2)(2, 0, 0) model for forecasting.

For better forecasts, it may be advisable to experiment with non-linear models, or conduct sub-sampling analysis (pre-pandemic and post-pandemic time series analysis)

6-Month Forecast
fit <- rice_ppi %>%
  model(arima = ARIMA(YoY ~ pdq(1, 0, 2) + PDQ(2, 0, 0)))

fc <- forecast(fit, h=6) 

fc %>% autoplot(rice_ppi) + labs(title = "6-Month Forecast: Rice Producer Price Index (YoY)", y = "% change")

fc
# A fable: 6 x 4 [1M]
# Key:     .model [1]
  .model MonthYear        YoY .mean
  <chr>      <mth>     <dist> <dbl>
1 arima   2022 Apr  N(21, 13)  21.0
2 arima   2022 May  N(20, 32)  20.1
3 arima   2022 Jun  N(21, 60)  20.6
4 arima   2022 Jul  N(19, 84)  19.2
5 arima   2022 Aug N(17, 105)  16.8
6 arima   2022 Sep N(13, 124)  13.0

Six-Month Point Forecasts:

  • 2022 Apr: 21%
  • 2022 May: 20.1%
  • 2022 Jun: 20.6%
  • 2022 Jul: 19.2%
  • 2022 Aug: 16.8%
  • 2022 Sep: 13%

The model forecasts that the producer price index will continue to increase as compared to the previous year, however, this YoY % change will decrease over time. It predicts that the highest YoY % change will be seen in April 2022 of 21%, with predicted PPI as 217.8

Economic Outlook and Policy

Rice yield has increased over the years and we forecast that it will continue to increase. This can be contributed to technological advances in farming methods.

Rice quantity produced has increased over the years but seems to have stagnated now over the past decade. California is a major producer of rice and the on-and-off drought since 2000 has significantly decreased rice production.

Rice producer price index has been volatile over the years with positive YoY change, but the % YoY change is predicted to decline gradually over the next 6 months.

Global demand for rice has been steadily rising and is expected to reach new highs. This opens opportunities for the government to increase rice exports and revenue. There is a growing yield with scope for more rice production. Creating policies, subsidized loans, and financial support that encourage farmers to grow rice and provide ease of exports can help generate revenue.

The government should also look at water conservation and planning policies to lessen the drought-like conditions in California, which is a key producer of rice.

Agricultural Crop 4: Avocados (Joshua)

Introduction

The avocado market is said to be valued at an estimated 14 billion dollars as of 2021 (Statista and Shahbandeh). In the United States alone avocado production is led by California followed by Florida and Hawaii to produce over 200,000 tons. But with a much high demand than what the US can provide it has transformed itself into a net importer, importing the majority of its supply from Mexico (AgMRC).

United States Avocado Imports and Domestic Arrivals

Avocado arrivals represent the incoming volume of avocados arriving into the U.S. Market from all suppliers (imports + domestic production) per week. Packed and loose volume of Avocados are combined and reported in total weight (in pounds).

Importing countries include: Mexico, Chile, Peru, Colombia, Dominican Republic. Domestic production: California

Data recorded from 2020 to 2022

#Lets load in data
imports2020 <- read.csv("Volume Data Projections - 2020.csv")
imports2021 <- read.csv("Volume Data Projections - 2021.csv")
imports2022 <- read.csv("Volume Data Projections - 2022.csv")

#merge
avo_imports <- rbind(imports2020, imports2021, imports2022)

#Remove delimited issue
avo_imports$Total.Volume <- as.numeric(gsub(",","",avo_imports$Total.Volume))

#rename total.volume
avo_imports <- rename(avo_imports, Total.imports = Total.Volume)

#Create tsibble
avo_imports_ts <- avo_imports %>%
  mutate(week_yr = yearweek(Week)) %>%
  as_tsibble(index = week_yr)

avo_imports_ts <- avo_imports_ts %>%
  filter(Status == "Actual")
#First look at data
avo_imports_ts %>%
  autoplot(Total.imports)+
  labs(title = "Total Volume of Avocados Arrival in US Market (per week)",  y= "Total Weight (in lbs)", x= "Week - Year") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

#Appears to be no trend, although there are only a few years

#Lets stabilize the variation in the data
avo_imports_ts <- avo_imports_ts %>%
  mutate(Total.imports.log = log(Total.imports))

#replot
avo_imports_ts %>%
  autoplot(Total.imports.log) +
  labs(title = "Log Total Volume of Avocados Arrivals in US Market (per week)",  y= "Log of Total Weight (in lbs)", x= "Week - Year") 

#transformed data is now slightly different

#Lets look at the seasonal plot to determine presense of seasonality
avo_imports_ts %>%
  gg_season(Total.imports.log) +
  labs(title = "Seasonal Plot Total Volume of Avocados Arrivals in US Market (per week)",  y= "Log of Total Weight (in lbs)", x= "Week - Year") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))


#Potential models:
#ARIMA with fourier terms
#Could try SNAIVE

There appears to be no trend throughout the series but there does appear to be a long seasonal period according to the seasonal plot. Imports are the highest during the beginning of the year, notably early 2021 saw its highest arrival of avocados at 78 million pounds. There on after the US reaches its peak arrivals for the year the levels begin to fall to a relatively constant size starting around April and persist until the end of the year. This seems to be the case for all years, 2022 also follows same path but only some of the data is available. Due note 2022 does not reach it’s seasonaly peak for the year due to a temporary ban on the US’s largest importer, Mexico, in the month of February.

The long seasonal component which looks to last the year before it repeats means we have to use a modeified ARIMA model which is also refered to as a dynamic harmonic regression that uses fourier terms. In addition we can use a seasonal naive model as a simple forecasting technique to observe its performance, although not likely to beat the dynamic model.

ARIMA Dynamic Harmonic

Model development was completed by dividing data set into a train and test set.

  • Test set: 2020 Week 1 to 2021 Week 46
  • Train set: 2021 Week 47 to end
#Need to test and find the ideal value of K for the model

#Before modeling divide datset into train set
train_imports <- avo_imports_ts %>%
  filter_index(.~ "2021 W46")

#Create model
fit_imports <- train_imports %>% 
  model(
  `K = 19` = ARIMA(Total.imports.log ~ fourier(K=26) + PDQ(0,0,0)),
  `K = 20` = ARIMA(Total.imports.log ~ fourier(K=20) + PDQ(0,0,0)),
  `K = 15` = ARIMA(Total.imports.log ~ fourier(K=15) + PDQ(0,0,0)),
  `K = 12` = ARIMA(Total.imports.log ~ fourier(K=12) + PDQ(0,0,0)),
  `K = 14` = ARIMA(Total.imports.log ~ fourier(K=14) + PDQ(0,0,0)),
  `K = 13` = ARIMA(Total.imports.log ~ fourier(K=13) + PDQ(0,0,0))
)

fit_imports %>%
  forecast(h = 24) %>%
  autoplot(avo_imports_ts, level = 95) +
  facet_wrap(vars(.model), ncol = 2) +
  guides(colour = "none", fill = "none", level = "none") +
  geom_label(
    aes(x = yearweek("2020 W27"), y = 16.5,
        label = paste0("AICc = ", format(AICc))),
    data = glance(fit_imports)
  ) +
  labs(title= "Forecast Total Volume of Avocados Arrivals in US Market (per week) ", subtitle = "Attempting different levels of fourier (K) terms",
       y="Log of Total Weight (in lbs)", x= "Week - Year")

After attempting different levels of fourier terms we can see that the preferred versions of the dynamic harmonic models were when K = 13 and K = 12 which both had the lowest AICc scores. We can further test them to see which of the two versions works best to forecast later on.

Accuracy Test
#Zoom into the K=13 model which had the best AICc score
fit_imports %>% select("K = 13") %>% forecast(h = 24) %>% autoplot(avo_imports_ts) + 
  labs(title = "Forecast Performance Against Test Set K=13", subtitle = "Total Volume of Avocados Arrivals", y= "Log Total Weight")

#Also zoom into 2nd best AICc model score, fourier term k = 12
fit_imports %>% select("K = 12") %>% forecast(h = 24) %>% autoplot(avo_imports_ts) + 
  labs(title = "Forecast Performance Against Test Set K=12", subtitle = "Total Volume of Avocados Arrivals", y= "Log Total Weight")

#Lets run an accuracy test for both models
fit_imports %>% select("K = 13", "K = 12") %>% forecast(h = 24) %>% accuracy(avo_imports_ts)
# A tibble: 2 x 10
  .model .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 K = 12 Test  -0.0706 0.285 0.203 -0.413  1.16  1.31  1.30 0.180
2 K = 13 Test  -0.0695 0.300 0.212 -0.408  1.21  1.36  1.36 0.153

According to the accuracy test which compared the training set to the test set of data, the K=12 model performed slightly better which we will use to test for white noise.

Residual Test
#We'll go with the K=12 model as it has a slightly better accuracy score

#Plot the residuals for K=12 model
fit_imports %>% select("K = 12") %>% gg_tsresiduals() + labs(title = "K = 12 model Residuals")

#First run a ljnug box to ckeck if model's residuals resemble white noise
fit_imports %>% select("K = 12") %>% augment() %>% 
  features(.innov, ljung_box, lag = 24, dof = 12)
# A tibble: 1 x 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 K = 12    14.5     0.269
#Good, the p-value is large (> 0.05) meaning the residuals are not distinguishable from a white noise series. The observations over time are random and independent.

After computing a ljung box test for the preferred model, the test indicates that the p-value is large (> 0.05) meaning the residuals are not distinguishable from a white noise series. That is, the observations over time are random and independent and we can move foward with producing a forecast.

#Lets forecast 1 year ahead
avo_imports_ts %>%
  model(ARIMA(Total.imports ~ fourier(K=12) + PDQ(0,0,0))) %>%
  forecast(h = 52) %>%
  autoplot(avo_imports_ts) + 
  labs(title = "1 Year Forecast of Total Volume of Avocados Arrivals in US", subtitle = "ARIMA (2,1,1) with K = 12 Fourier terms", y= "Total Weight (in pounds)") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

After using the dynamic harmonic model with 14 fourier terms we were able to produce a 1 year forecast. The model does a good job in capturing 2 main features of the dataset. The first being that there is a large increase, the largest, of avocado arrivals at the beginning of the year (can be seen aroun 2020 W52). The second feature is the second wave of imports that follows the largest shipment of avocados (can be seen around week 20 in both years of recorded data). That wave then sets the progress for a slow trend towards till the end of the year until the whole pattern repeats.

According to the forecast a similar but lower trajectory is expected compared to previous years in terms of the amount of avocados that will reach the US market. For instance in 2023 the US Market is projected to reach its highest shipment of avocados in week 2 with about 71,500,648 pounds of avocados. The high volume of avocados that occurs every year in February is most likely due to the annual NFL Super Bowl that takes place. Another note is that this projection is much lower than the previous 2 years.

Largely because this data is composed of imports, each country has been affected differently in terms of production levels given the regulations that have or were in place during the pandemic. As this is a total of all imports plus one domestic producing location (California), domestic producers benefit from being able to use this indicator as a way to understand the volume of the avocados that will be entering the US market. Given the large impact that the US economy has faced in returing to pre-pandemic levels domestic producers have to adjust the amount of avocados they plan on picking and sending off into the US market in conjuction with analyizing the demand that will be required (something we’ll examin shortly).

Furthermore governments, in this case the United States, have the ability to use this imformation to understand if the projected arrivals of avocados into the market will satisfy demand. If there is an assumption that arrivals will fall short, the government can look to lower tariffs as a way to increase supply from importing countries.

SNAIVE
#fit SNAIVE model
fit_totalimports_snai <- train_imports %>%
  model(`Seasonal naïve` = SNAIVE(Total.imports.log))
fc_totalimports_snai <-  fit_totalimports_snai %>%
  forecast(h = 24)
#Plot 
fc_totalimports_snai %>% autoplot(avo_imports_ts) + 
  labs(title = "Forecast Performance Against Test Set SNAIVE", subtitle = "Total Volume of Avocados Arrivals", x= "Log Total Imports")

#Also compute ljung box test
augment(fit_totalimports_snai) %>%
  features(.innov, ljung_box, lag = 24, dof = 0)
# A tibble: 1 x 3
  .model         lb_stat lb_pvalue
  <chr>            <dbl>     <dbl>
1 Seasonal naïve    31.4     0.142
#We see that the p-value is >0.05 which is good as stated before.

#accuracy test
accuracy(fc_totalimports_snai, avo_imports_ts)
# A tibble: 1 x 10
  .model         .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE   ACF1
  <chr>          <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Seasonal naïve Test  -0.157 0.308 0.204 -0.904  1.17  1.32  1.40 0.0398
#In comparison to the accuracy test of the Dynamic Harmonic model the better model is the Harmonic model as the has a lower RMSE score

#Although it is not the preferred model we like we can forecast 1 year ahead to visually see how the model performed
#Forecast 1 year ahead
avo_imports_ts %>%
  model(SNAIVE(Total.imports)) %>%
  forecast(h = 52) %>%
  autoplot(avo_imports_ts) + 
  labs(title = "1 Year Forecast of Total Avocado Arrivals", subtitle = "Seasonal Naive Method") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

As explained the ideal model for forecasting total avocado arrivals is through the use of the ARIMA (2,1,1) with K = 12 Fourier terms model.

Total Avocado Sales (Non-organic Avocados in California)

Total number of avocados sold per week in the United States which includes regional and city level data. This is retail scan data which comes directly from retailers’ cash registers based on sales of Hass avocados.

Some of the Regions/Cities: Los Angeles, New York, South East United States, West United States, California

Type of Avocado: Organic, Non-Organic

Data recorded from 2015 to 2018

avocado <- read.csv("avocado.csv")
#Focus on California + non-organic avocados
avo_nonorganic_ca <- avocado %>%
  filter(region == "California") 

avo_nonorganic_ca <-  avo_nonorganic_ca %>%
  filter(type == "conventional") %>%
  select(Date, Total.Volume, AveragePrice, type, year, region)
  

#Convert to tsibble
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(yearweek = yearweek(Date)) %>%
  tsibble(index = yearweek)
#First look at data
avo_nonorganic_ca %>%
  autoplot(Total.Volume) +
  labs(title = "California Total Volume of Non-Organic Avocados Sold",  y= "Number of Avocados", x= "Week - Year") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

#Variation changes overtime, lets transform to improve data
#Take log of TOTAL VOLUME
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(Total.Vol.log = log(Total.Volume)) 

#Lets replot with log
avo_nonorganic_ca %>%
  autoplot(Total.Vol.log) +
  labs(title = "Log - California Total Volume of Non-Organic Avocados Sold",  y= "Log Number of Avocados", x= "Week - Year") 
#sutle difference visually, extreme values/peaks are closer to the rest of the peaks

#Addition to autoplot lets see what features we can bring out to determine potential models
avo_nonorganic_ca %>%
  gg_season(Total.Vol.log) +
  labs(title= "Seasonal Plot of Total Volume Avocados", y= "Number of Avocados (log)", x= "Week")

#A bit messy but each year follows the same similar path, highest peak in FEB, second largest peak in MARCH, yearly low in OCT, repeat

#Lets decompose the data
avo_nonorganic_ca %>%
  model(STL(Total.Vol.log ~ trend(window = 7) + 
              season(window = "periodic"),
            robust = TRUE)) %>%
  components() %>%
  autoplot()

#Potential models: 
#Seasonal Naive, due to the seasonal pattern that is similar across years

#Seasonal ARIMA, again we see a seasonal pattern that is worth looking into (modified, now use ARIMA but as a dynamic harmonic regression model)

From the time plot that was constructed we can see that there also seems to be no signs of a trend, but rather a long seasonal pattern that repeats every year. Once we contruct the seasonal plot we can see a general flow of how consumption of avocados changes based on the time of year. The highest time for avocado consumption takes place in February. Around the same time the large influx of avocado arrivals happens, as seen in the previous time series. Again this could be in indication of the annual Super Bowl event that is taking place. We see a slow decline each year where the lowest time for avocado consumption happens to be around October before slowly rising again.

Best option for forecasting is to use a dynamic harmonic model with fourier terms. In addition we can attempt a seasonal naive model to capture the seasonality of the consumption.

#Create train set for model 

train_totvol <- avo_nonorganic_ca %>%
  filter_index(.~ "2017 W31")
ARIMA Dynamic harmonic model

Model development was completed by dividing data set into a train and test set.

  • Test set: 2015 Week 1 to 2017 Week 31
  • Train set: 2017 Week 32 to end
#Test different levels of K to determine ideal number of pairs of sin and cos terms to include
fit <- train_totvol %>% 
  model(
  `K = 3` = ARIMA(Total.Vol.log ~ fourier(K=3) + PDQ(0,0,0)),
  `K = 15` = ARIMA(Total.Vol.log ~ fourier(K=15) + PDQ(0,0,0)),
  `K = 12` = ARIMA(Total.Vol.log ~ fourier(K=12) + PDQ(0,0,0)),
  `K = 6` = ARIMA(Total.Vol.log ~ fourier(K=6) + PDQ(0,0,0)),
  `K = 14` = ARIMA(Total.Vol.log ~ fourier(K=14) + PDQ(0,0,0)),
  `K = 4` = ARIMA(Total.Vol.log ~ fourier(K=4) + PDQ(0,0,0))
)

fit %>%
  forecast(h = 33) %>%
  autoplot(avo_nonorganic_ca, level = 95) +
  facet_wrap(vars(.model), ncol = 2) +
  guides(colour = "none", fill = "none", level = "none") +
  geom_label(
    aes(x = yearweek("2015 W53"), y = 16.5,
        label = paste0("AICc = ", format(AICc))),
    data = glance(fit)
  ) +
  labs(title= "Forecast Total Volume Avocados", subtitle = "Attempting different levels of fourier (K) terms",
       x="Week - Year")

#K=3 resulted in the lowest/best AICc score

Here the plot above is testing different levels of fourier terms. The model with the best AICc score is a model when K=3 followed by a model when k=15.

Accuracy Test
#Model with best AICc score was when the fourier term k = 3
#Lets zoom into forecast
fit %>% select("K = 3") %>% forecast(h = 33) %>% autoplot(avo_nonorganic_ca) + labs(title = "Forecast Performance Against Test Set K=3", subtitle = "Total Volume of Avocados Arrivals", y= "Log Total Imports")

#Lets also plot the 2nd best AICc model, fourier term k = 15
fit %>% select("K = 15") %>% forecast(h = 33) %>% autoplot(avo_nonorganic_ca) + labs(title = "Forecast Performance Against Test Set K=15", subtitle = "Total Volume of Avocados Arrivals", y= "Log Total Imports")

#Lets run an accuracy test for both models
fit %>% select("K = 3", "K = 15") %>% forecast(h = 33) %>% accuracy(avo_nonorganic_ca)
# A tibble: 2 x 10
  .model .type      ME  RMSE   MAE     MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1 K = 15 Test  -0.0101 0.192 0.157 -0.0775  1.01  1.14  1.09 0.571
2 K = 3  Test  -0.0198 0.193 0.160 -0.141   1.03  1.17  1.09 0.566

According to the accuracy test both models had almost identical accuracy scores according to the RMSE statistic. But the K=15 has a slightly better score.

Residual Test
#Plot residuals for both models
fit %>% select("K = 15") %>% gg_tsresiduals() + labs(title = "K = 15 model Residuals")

fit %>% select("K = 3") %>% gg_tsresiduals() + labs(title = "K = 3 model Residuals")
#Run ljung box test for both models
fit %>% select("K = 3") %>% augment() %>% 
  features(.innov, box_pierce, lag = 26, dof = 3)
# A tibble: 1 x 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 K = 3     32.1    0.0983
fit %>% select("K = 15") %>% augment() %>% 
  features(.innov, box_pierce, lag = 26, dof = 15)
# A tibble: 1 x 3
  .model bp_stat bp_pvalue
  <chr>    <dbl>     <dbl>
1 K = 15    21.9    0.0252

Given how close the AICc score were and the how the K=15 model had a slightly better RMSE score, we’ll forecast both.

Do note that according to the ljung box test both had p-values that were not > 0.05 to conclude that residuals are not distinguishable from a white noise series. When We do run a box-pierce test we see that the K=3 model passed with a p-value > 0.05 to say that the residuals are not distinguishable from a white noise. So preferably we can go with the K=3 forecast but we don’t have to rule out using the K=15 model as the acf plot visually looks like white noise it is just possible that there is some underlying cyclicity that the model could be struggling to capture.

#Given how close the AICc score were and the how the K=15 model had a slightly better RMSE score, we'll forecast both. Do note according to the ljung box test both had p-value that were not > 0.05 to conclude that residuals are not distinguishable from a white noise series. When We do run a box-pierce test we see that the K=3 model passed with a p-value > 0.05 to say that the residuals are not distinguishable from a white noise. So preferably we can go with the K=3 forecast but we don't have to rule out using the K=15 model as the acf plot visually looks like white noise it is just possible that there is some underlying cyclicity that the model could be struggling to capture.

#Since this data is retail we'll forecast 1 year ahead
#Forecast for K = 3 which has ARIMA (0,0,1)
avo_nonorganic_ca %>%
  model(ARIMA(Total.Volume ~ fourier(K=3) + PDQ(0,0,0))) %>%
  forecast(h = 52) %>%
  autoplot(avo_nonorganic_ca) + 
  labs(title = "1 Year Forecast of Total Non-Organic Avocado Weekly Sales", subtitle = "ARIMA (0,0,1) with K = 3 Fourier terms") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

#Forecast for K = 15 which has ARIMA (1,0,2)
avo_nonorganic_ca %>%
  model(ARIMA(Total.Volume ~ fourier(K=15) + PDQ(0,0,0))) %>%
  forecast(h = 52) %>%
  autoplot(avo_nonorganic_ca) + 
  labs(title = "1 Year Forecast of Total Non-Organic Avocado Weekly Sales", subtitle = "ARIMA (1,0,2) with K = 15 Fourier terms") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Both models produce two visually different results, the K=3 has a smooth curve while the k=15 is more detailed and looks more like the rest of the dataset. But we favor the k=15 model as it captures the key peak that we previously stated takes place in February which sees the most consumption compared to any other time of the year.

Lets recall that this data reflects California’s avocado consumption level. So for example in February of 2019 during week 5 it was projected that the state would consume around 8,597,248, which in comparison to prevoius years is a lot less (from 2015-2018 the highest level of consumption reached over 10 million avocados in a week).

This dataset takes on a zoomed image of the US and focuses on a particular region of the US, which is beneficial to producers as previously stated because producers can adjust the volume of avocados that are being introduced into the market. It is important to realize that producers are not only affected from a producing standpoint, whether there is a shortage of workers, truckers, etc but also the instability of the economy dictates the how the end consumer sees the product and decides whether they should spend on it. Producers benefit from demand and require that avocados in this case be marketed to correct group of consumers to increase the demand. So knowing how much a producer should expect to produce to meet demand but also know of ways to increase that demand expectation are proper use cases for this forecast

Seasonal Naive
#fit SNAIVE model
fit_totalvol_sn <- train_totvol %>%
  model(`Seasonal naïve` = SNAIVE(Total.Vol.log))
fc_totalvol_sn <-  fit_totalvol_sn %>%
  forecast(h = 33)
#Plot
fc_totalvol_sn %>% autoplot(avo_nonorganic_ca) + labs(title = "Seasonal Naive Forecast Against Test Set", x="Week - Year", y="Log Total volume")

#accuracy test
fit_totalvol_sn %>% forecast(h = 33) %>% accuracy(avo_nonorganic_ca)
# A tibble: 1 x 10
  .model         .type      ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
  <chr>          <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 Seasonal naïve Test  -0.0246 0.262 0.211 -0.170  1.37  1.54  1.48 0.811
#Also compute ljung box test
augment(fit_totalvol_sn) %>%
  features(.innov, ljung_box, lag = 52, dof = 0)
# A tibble: 1 x 3
  .model         lb_stat    lb_pvalue
  <chr>            <dbl>        <dbl>
1 Seasonal naïve    130. 0.0000000127
#The seasonal naive did not have a better accuracy score than the fourier arima models that were ran. But it did have passing ljung box test which means the model can be used to forecast
#Forecast for the seasonal naive
avo_nonorganic_ca %>%
  model(SNAIVE(Total.Volume)) %>%
  forecast(h = 52) %>%
  autoplot(avo_nonorganic_ca) + 
  labs(title = "1 Year Forecast of Total Non-Organic Avocado Weekly Sales", subtitle = "Seasonal Naive Method") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

#visually it does a good job in capturing the large spike that happens in February which is due to the NFL SuperBowl event that takes place. But of course what is happening here is that the model is taking on previous values and using those as forecast

Avocado Average Price (Non-organic Avocados in California)

Part of the avocados sales dataset, the average price of avocados represents the average price of a single avocado in select regions and cities across the United States on a per weekly basis.

Dates recorded from 2015 to 2018

#Before analysis we have to convert nominal prices to real prices
#download cpi
cpi <- read.csv("CPI4.csv")

#Need to merge the cpi data which is monthly to my weekly data
#Basically I will match the cpi values to my weekly data according to the month & year

#create monthly column
cpi <- cpi %>%
  mutate(monthyr = yearmonth(DATE))

avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(monthyr = yearmonth(Date))

#merge cpi by month
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  left_join(cpi, "monthyr")


#clean dataset, delete duplicate date column USACPIALLMINMEI
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  select(-DATE)

avo_nonorganic_ca <- rename(avo_nonorganic_ca, cpi = CPIAUCSL)

#base year: 2018 Jan, CPI = 248.743 
#Dived CPI base to CPI of respective date
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(reindex.decimal = cpi/248.743)

#Now we divide nominal price by reindex
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(realprice = AveragePrice/reindex.decimal)

#to save editing time, rename real price to AveragePrice, bascially will swap real price column with the nominal price
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(AveragePrice = realprice)

avo_nonorganic_ca %>%
  select(AveragePrice, realprice)
# A tsibble: 169 x 3 [1W]
   AveragePrice realprice yearweek
          <dbl>     <dbl>   <week>
 1        0.985     0.985 2015 W01
 2        0.975     0.975 2015 W02
 3        1.08      1.08  2015 W03
 4        1.12      1.12  2015 W04
 5        0.898     0.898 2015 W05
 6        0.962     0.962 2015 W06
 7        1.01      1.01  2015 W07
 8        1.01      1.01  2015 W08
 9        0.885     0.885 2015 W09
10        1.01      1.01  2015 W10
# ... with 159 more rows
#First look at data
avo_nonorganic_ca %>%
  autoplot(AveragePrice) +
  labs(title = "California Real Average Price of Single Non-Organic Avocado", subtitle = "Price in January 2018 Prices",  y= "Price of an Avocado (USD)", x= "Week - Year") 

#Variation changes overtime, lets transform to improve data
#Take log of avg price
avo_nonorganic_ca <- avo_nonorganic_ca %>%
  mutate(AvgPricelog = log(AveragePrice)) 

#Lets replot with log
avo_nonorganic_ca %>%
  autoplot(AvgPricelog) +
  labs(title = "Log - California Real Average Price of Single Avocado", subtitle = "Transformed Data",  y= "Log Price of an Avocado (USD)", x= "Week - Year") 

#Looks like it helped

#Lets see if there is a seasonal period that exists
avo_nonorganic_ca %>%
  gg_season(AvgPricelog) +
  labs(title= "Seasonal Plot of Real Average Price of an Avocado in California", y= "Average Price (log)", x= "Week")


#Lets decompose the data
avo_nonorganic_ca %>%
  model(STL(AvgPricelog ~ trend(window = 7) + 
              season(window = "periodic"),
            robust = TRUE)) %>%
  components() %>%
  autoplot()

#Looks like there is no trend, long seasonality seems to be present as the beginning of the starts off as with a steep drop around February and begins its rise to October to reach the highest price of the year and drop afterwards till the end of year, and then repeats.

#Long seasonality means we have to rely on a ARIMA model that uses fourier terms
#Similar to the other series we can also attempt a SNAIVE model

Based on the time series plot ther is no clear trend. Long seasonality seems to be present as the beginning starts off with a steep drop around February and then sees a rising perstistance right after and up until October to reach the highest price of the year. 2016 and 2017 are clear representations of the persistance that is experienced. Afterwards, prices then see a persistant drop for the rest of the year and then repeats the following year.

The most ideal model due to the long seasonality means we have to rely on a ARIMA model that uses fourier terms. Similar to the other series we can also attempt a SNAIVE model.

#Create train set for model
#First thing before any model, create training set
train_avgprice <- avo_nonorganic_ca %>%
  filter_index(.~ "2017 W31")
ARIMA Dynamic harmonic

Model development was completed by dividing data set into a train and test set.

  • Test set: 2015 Week 1 to 2017 Week 31
  • Train set: 2017 Week 32 to end
#Find the ideal value of K for model
fit_avgprice_dhr <- train_avgprice %>% 
  model(
  `K = 3` = ARIMA(AvgPricelog ~ fourier(K=3) + PDQ(0,0,0)),
  `K = 11` = ARIMA(AvgPricelog ~ fourier(K=11) + PDQ(0,0,0)),
  `K = 5` = ARIMA(AvgPricelog ~ fourier(K=5) + PDQ(0,0,0)),
  `K = 1` = ARIMA(AvgPricelog ~ fourier(K=1) + PDQ(0,0,0)),
  `K = 2` = ARIMA(AvgPricelog ~ fourier(K=2) + PDQ(0,0,0)),
  `K = 4` = ARIMA(AvgPricelog ~ fourier(K=4) + PDQ(0,0,0))
)

fit_avgprice_dhr %>%
  forecast(h = 33) %>%
  autoplot(avo_nonorganic_ca, level = 95) +
  facet_wrap(vars(.model), ncol = 2) +
  guides(colour = "none", fill = "none", level = "none") +
  geom_label(
    aes(x = yearweek("2015 W53"), y = -0.4,
        label = paste0("AICc = ", format(AICc))),
    data = glance(fit_avgprice_dhr)
  ) +
  labs(title= "Forecast of Real Average Price of Avocado in California ", subtitle = "Attempting different levels of fourier (K) terms",
       y="Log Average Price (USD)")

When testing multiple fourier level we found that a k=3 and a k=11 model had the best potential to producing a good forecast as they both had the best AICc scores.

Accuracy Test
#The model with the best Aicc score was when k=3
#We can also zoom into the model
fit_avgprice_dhr %>% select("K = 3") %>% forecast(h = 33) %>% autoplot(avo_nonorganic_ca) + labs(title = "ARIMA K = 3 against test set")

#Plot the 2nd best AICc model, fourier term k = 11
fit_avgprice_dhr %>% select("K = 5") %>% forecast(h = 33) %>% autoplot(avo_nonorganic_ca) + labs(title = "ARIMA K = 5 against test set")

#Lets run an accuracy test for both models
fit_avgprice_dhr %>% select("K = 3", "K = 5") %>% forecast(h = 33) %>% accuracy(avo_nonorganic_ca)
# A tibble: 2 x 10
  .model .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 K = 3  Test  0.0345 0.170 0.151 -16.1  146. 0.919 0.818 0.655
2 K = 5  Test  0.0414 0.175 0.147 -25.4  137. 0.896 0.839 0.650
#

According to the accuracy test the K=3 model is still the ideal model to use based on the RMSE score.

Residual Test
#plot residuals
fit_avgprice_dhr %>% select("K = 3") %>% gg_tsresiduals() + labs(title = "K = 3 model Residuals")

#Lets run a jung box test on k=3 model, to examine if the residuals are resemble white noise

fit_avgprice_dhr %>% select("K = 3") %>% augment() %>% 
  features(.innov, ljung_box, lag = 26, dof = 3)
# A tibble: 1 x 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 K = 3     26.7     0.271

The model passes the test and has a p-value > 0.05 which means that the residuals from the model are not distinguisable from a white noise. Meaning we can use this model to forecast

#The most ideal model was the ARIMA (1,1,1) with K = 3 fourier terms

#Since this data is retail we'll forecast 1 year ahead
#Forecast for K = 3 which has ARIMA (1,1,1)
avo_nonorganic_ca %>%
  model(ARIMA(AveragePrice ~ fourier(K=3) + PDQ(0,0,0))) %>%
  forecast(h = 52) %>%
  autoplot(avo_nonorganic_ca) + 
  labs(title = "1 Year Forecast of Average Non-Organic Avocado price for CA", subtitle = "Price in January 2018 Prices, ARIMA (1,1,1) with K = 3 Fourier terms", y= "Average Price (USD)") + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

We can see that due to the low number of fourier terms we get a forecast with a smooth curve. Although it does not replicate the large spike in price that typically takes place at the end of the year it still indicates that around that time frame we will expect the highest price on average for an individual avocado. Overall the projected price is expected to follow a similar levels to what was experienced in 2017 but not as similar to that of years before that. > Between weeks 35 and 42 (2018) we see the highest projected prices for a single avocado which is floats around $1.40. Compared to the beginning of the year where we typically see the lowest average price. In the first 3 weeks of 2019 we projected the average price to be around 98 cents

Given this is retail data and the pricing is ultimalty dictated by the retail store, it still reflects import volume that was forecasted above. During the end of the year it is a seasonally feature to see a dip in the amount of avocado arrivals and thus lead to a spike in price due to the shortage. The forecast benefits producers in being able to manage risk by controlling how much avocado they put out but also at what price the producer is willing to sell at. This dictates the buying price that retailers pay at because the price of an avocado is a reflection of how much the retailer paid to obtain avocado to sell it to the end consumer. Ultimatly it helps produce a healthy price that consumers can purchase.

Seasonal Naive
#Fit the SNAIVE model
fit_avgprice_snai <- train_avgprice %>%
  model(`Seasonal naïve` = SNAIVE(AvgPricelog))
#Lets forecast against the test set
fc_avgprice_snai <- fit_avgprice_snai %>%
  forecast(h = 33)
#plot the forecast
fc_avgprice_snai %>%
  autoplot(avo_nonorganic_ca) + labs(title = "Forecast of Average Avocado Price SNIAVE", subtitle = "Against test set")

#lets run an accuracy test
accuracy(fc_avgprice_snai, avo_nonorganic_ca)
# A tibble: 1 x 10
  .model         .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
  <chr>          <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Seasonal naïve Test  0.117 0.271 0.231 -61.1  201.  1.41  1.31 0.798
#Lets also run a ljung box test
augment(fit_avgprice_snai) %>%
  features(.innov, ljung_box, lag = 26, dof = 0)
# A tibble: 1 x 3
  .model         lb_stat lb_pvalue
  <chr>            <dbl>     <dbl>
1 Seasonal naïve    262.         0
#looks like models errors are autocorrelated meaning we can't use this model as the residuals are not different from zero.

Work Cited

Ag Marketing Resource Center. “Avocados.” Copyright 2022 Agricultural Marketing Resource Center. All Rights Reserved., Ag Marketing Resource Center, 2022, www.agmrc.org/commodities-products/fruits/avocados.

“The Challenges of the Global Rice Market - Demand and Supply.” AgronoMag, 2 May 2022, https://agronomag.com/challenges-global-rice-market/.

“Cotton Sector at a Glance.” USDA ERS - Cotton Sector at a Glance, https://www.ers.usda.gov/topics/crops/cotton-wool/cotton-sector-at-a-glance/.

Faostat, https://www.fao.org/faostat/en/#data. (Data source)

FRED, https://fred.stlouisfed.org/ (Data source)

Hayes, Tara O’Neill, et al. “Primer: Agriculture Subsidies and Their Influence on the Composition of U.S. Food Supply and Consumption.” AAF, 3 Nov. 2021, https://www.americanactionforum.org/research/primer-agriculture-subsidies-and-their-influence-on-the-composition-of-u-s-food-supply-and-consumption/#:~:text=Farm%20Bill%20Overview&text=Subsidies%20for%20farmers%20averaged%20%2416,concentrated%20among%20a%20select%20few.

“Rice Sector at a Glance.” USDA ERS - Rice Sector at a Glance, https://www.ers.usda.gov/topics/crops/rice/rice-sector-at-a-glance/.

Semuels, Alana. “American Farmers Are in Crisis. Here’s Why.” Time, Time, 27 Nov. 2019, https://time.com/5736789/small-american-farmers-debt-crisis-extinction/.

“Soybeans.” OEC, https://oec.world/en/profile/hs/soybeans.

Statista, and M. Shahbandeh. “Global Avocado Market Value 2019–2025.” Statista, 7 Mar. 2022, www.statista.com/statistics/931183/global-avocado-market-value.

(Feel Free to reach out for the actual data)